Is HDFS necessary for Spark workloads? - hadoop

HDFS is not necessary but recommendations appear in some places.
To help evaluate the effort spent in getting HDFS running:
What are the benefits of using HDFS for Spark workloads?

Spark is a distributed processing engine and HDFS is a distributed storage system.
If HDFS is not an option, then Spark has to use some other alternative in form of Apache Cassandra Or Amazon S3.
Have a look at this comparision
S3 – Non urgent batch jobs. S3 fits very specific use cases, when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
When to use HDFS as storage engine for Spark distributed processing?
If you have big Hadoop cluster already in place and looking for real time analytics of your data, Spark can use existing Hadoop cluster. It will reduce development time.
Spark is in-memory computing engine. Since data can't fit into memory always, data has to be spilled to disk for some operations. Spark will benifit from HDFS in this case. The Teragen sorting record achieved by Spark used HDFS storage for sorting operation.
HDFS is scalable, reliable and fault tolerant distributed file system ( since Hadoop 2.x release). With data locality principle, processing speed is improved.
Best for Batch-processing jobs.

The shortest answer is:"No, you don't need it". You can analyse data even without HDFS, but off course you need to replicate the data on all your nodes.
The long answer is quite counterintuitive and i'm still tryng to understand it with the help stackoverflow community.
Spark local vs hdfs permormance

HDFS (or any distributed Filesystems) makes distributing your data much simpler. Using a local filesystem you would have to partition/copy the data by hand to the individual nodes and be aware of the data distribution when running your jobs. In addition HDFS also handles failing nodes failures.
From an integration between Spark and HDFS, you can imagine spark knowing about the data distribution so it will try to schedule tasks to the same nodes where the required data resides.
Second: which problems did you face exactly with the instruction?
BTW: if you are just looking for an easy setup on AWS, DCOS allows you to install HDFS with a single command...

So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.
HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).
You could use S3 and Redshift.
See here:
https://github.com/databricks/spark-redshift

Related

Alluxio with/without HDFS

I have a cluster with HDFS as an under storage distributed file system, but I've just read about alluxio that is fast and flexible. So, My question is: Should I use Alluxio with HDFS or Alluxio is alternative for HDFS? (I see in their site that shared storage for under storage file system can be network file system (NFS). So, I think HDFS is not required. Correct me if I make a mistake).
In which mode performance is better: HDFS with Alluxio or Alluxio stanalone (what I mean the term standalone is to be used alone in the cluster and not locally).
Reply from Alluxio maintainer.
First of all, Alluxio is not a replacement for HDFS. Instead, it is a new abstraction layer on top of other distributed/cloud storage systems including HDFS, S3, Azure Object Store and other possible choices. In your case, if you data is already in HDFS, you will perhaps still keep HDFS as the persistent data layer for Alluxio.
The typical scenarios users put Alluxio in the picture and see significant benefits include:
Your physical data is not located with your compute. E.g., your bigdata engine is reading data from S3 or other object storage. In this case, by deploying Alluxio with compute nodes, one can make Alluxio work as a filesystem level cache to avoid fetching data across network repeatedly. See http://www.alluxio.org/overview/remote-data-acceleration
You are managing multiple storages and want to expose a single data access layer to simplify the management. E.g., one can "mount" multiple S3/ buckets into one Alluxio deployment so they appear as different directories under the same namespace. See http://www.alluxio.org/overview/storage-unification
Regarding your original performance question. The answer is, it depends. If your HDFS is remote from compute, you would expect a good performance gain. I also saw cases when HDFS is bottlenecked, Alluxio may also help to reduce the load and provides good SLA for certain mission-critical jobs.

Cassandra vs HDFS to store analytics data

we have an Apache Spark cluster that analyse data stored in HDFS (.parquet).
The solution is optimal in terms of performance but it's not disaster safe as we would like, indeed, HDFS architecture has a single point of failure (the namenode) even using two namenode (you just have 2 point of failure but it's not enough).
To improve our cluster fault tolerance we would like to move to another data store solution like Cassandra.
Questions are:
With Cassandra as datastore is Spark able to leverage on DataLocality as it do with HDFS?
How this change can affect the performance?
Thanks
Matteo
There's article about data locality, spark and Cassandra, so yes, it is possible:
https://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1
I didn't done any performance checks with Spark on HDFS vs Cassandra, and i believe it will vary depending on different workflows, but since Netflix and Microsoft using Cassandra with Spark, i believe performance is acceptable in most cases, and probably is a trade-off between data ingestion speed, existence/nonexistence of ETL and speed of the analytical process.
About hadoop single point of failure - If you will run Cassandra with replication factor 3 and consistency level quorum, you will get same 2 nodes down that will make data unavailable :) , keep it in mind.
And maybe consider MapR hadoop distribution, they've tried to solve namenode problem.

Spark with HBASE vs Spark with HDFS

I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.
Questions:
1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?
2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?
1) What are the added capabilities brought by layering Spark on top of
HBASE instead of using HBASE solely? It depends only on programmer
capabilities or is there any performance reason to do that? Are there
things Spark can do and HBASE solely can't do?
At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.
2) Stemming from previous question, when you should add HBASE between
HDFS and SPARK instead of using directly HDFS?
Small reads, concurrent write/read patterns, incremental updates (most etl)
Good luck...
I'd say that using distributed computing engines like Apache Hadoop or Apache Spark imply basically a full scan of any data source. That's the whole point of processing the data all at once.
HBase is good at cherry-picking particular records, while HDFS certainly much more performant with full scans.
When you do a write to HBase from Hadoop or Spark, you won't write it to database is usual - it's hugely slow! Instead, you want to write the data to HFiles directly and then bulk import them into.
The reason people invent SQL databases is because HDDs were very very slow at that time. It took the most clever people tens of years to invent different kind of indexes to clever use the bottleneck resource (disk). Now people try to invent NoSQL - we like associative arrays and we need them be distributed (that's what essentially what NoSQL is) - they're very simple and very convenient. But in todays world with SSDs being cheap no one needs databases - file system is good enough in most cases. The one thing, though, is that it has to be distributed to keep up the distributed computations.
Answering original questions:
These are two different tools for completely different problems.
I think if you use Apache Spark for data analysis, you have to avoid HBase (Cassandra or any other database). They can be useful to keep aggregated data to build reports or picking specific records about users or items, but that's happen after the processing.
Hbase is a No SQL data base that works well to fetch your data in a fast fashion. Though it is a db, it used large number of Hfile(similar to HDFS files) to store your data and a low latency acces.
So use Hbase when it suits a requirement that your data needs to accessed by other big data.
Spark on the other hand, is the in-memory distributed computing engine which have connectivity to hdfs, hbase, hive, postgreSQL,json files,parquet files etc.
There is no considerable performance change while reading from a HDFS file or Hbase upto some gbs. After that Hbase connectivity is becoming faster....

use spark to copy data across hadoop cluster

I have a situation where I have to copy data/files from PROD to UAT (hadoop clusters). For that I am using 'distcp' now. but it is taking forever. As distcp uses map-reduce under the hood, is there any way to use spark to make the process any faster? Like we can set hive execution engine to 'TEZ' (to replace map-reduce), can we set execution engine to spark for distcp? Or is there any other 'spark' way to copy data across clusters which may not even bother about distcp?
And here comes my second question (assuming we can set distcp execution engine to spark instead of map-reduce, please don't bother to answer this one otherwise):-
As per my knowledge Spark is faster than map-reduce mainly because it stores data in the memory which it might need to process in several occasions so that it does not have to load the data all the way from disk. Here we are copying data across clusters, so there is no need to process one file (or block or split) more than once as each file will go up into the memory then will be sent over the network, gets copied to the destination cluster disk, end of the story for that file. Then how come Spark makes the process faster if the main feature is not used?
Your bottlenecks on bulk cross-cluster IO are usually
bandwidth between clusters
read bandwidth off the source cluster
write bandwidth to the destination cluster (and with 3x replication, writes do take up disk and switch bandwidth)
allocated space for work (i.e. number of executors, tasks)
Generally on long-distance uploads its your long-haul network that is the bottleneck: you don't need that many workers to flood the network.
There's a famous tale of a distcp operation between two Yahoo! clusters which did manage to do exactly that to part of the backbone: the Hadoop ops team happy that the distcp was going so fast, while the networks ops team are panicing that their core services were somehow suffering due to the traffic between two sites. I believe this incident is the reason that distcp now has a -bandwidth option :)
Where there may be limitations in distcp, it's probably in task setup and execution: the decision of which files to copy is made in advance and there's not much (any?) intelligence in rescheduling work if some files copy fast but others are outstanding.
Distcp just builds up the list in advance and hands it off to the special distcp mappers, each of which reads its list of files and copies it over.
Someone could try doing a spark version of distcp; it could be an interesting project if someone wanted to work on better scheduling, relying on the fact that spark is very efficient at pushing out new work to existing executors: a spark version could push out work dynamically, rather than listing everything in advance. Indeed, it could still start the copy operation while enumerating the files to copy, for a faster startup time. Even so: cross-cluster bandwidth will usually be the choke point.
Spark is not really intended for data movement between Hadoop clusters. You may want to look into additional mappers for your distcp job using the "-m" option.

MapReduce or Spark for Batch processing on Hadoop?

I know that MapReduce is a great framework for batch processing on Hadoop. But, Spark also can be used as batch framework on Hadoop that provides scalability, fault tolerance and high performance compared MapReduce. Cloudera, Hortonworks and MapR started supporting Spark on Hadoop with YARN as well.
But, a lot of companies are still using MapReduce Framework on Hadoop for batch processing instead of Spark.
So, I am trying to understand what are the current challenges of Spark to be used as batch processing framework on Hadoop?
Any thoughts?
Spark is an order of magnitude faster than mapreduce for iterative algorithms, since it gets a significant speedup from keeping intermediate data cached in the local JVM.
With Spark 1.1 which primarily includes a new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager for sending shuffle data), a new external shuffle service made Spark perform the fastest PetaByte sort (on 190 nodes with 46TB RAM) and TeraByte sort breaking Hadoop's old record.
Spark can easily handle the dataset which are order of magnitude larger than the cluster's aggregate memory. So, my thought is that Spark is heading in the right direction and will eventually get even better.
For reference this blog post explains how databricks performed the petabyte sort.
I'm assuming when you say Hadoop you mean HDFS.
There are number of benefits of using Spark over Hadoop MR.
Performance: Spark is at least as fast as Hadoop MR. For iterative algorithms (that need to perform number of iterations of the same dataset) is can be a few orders of magnitude faster. Map-reduce writes the output of each stage to HDFS.
1.1. Spark can cache (depending on the available memory) this intermediate results and therefore reduce latency due to disk IO.
1.2. Spark operations are lazy. This means Spark can perform certain optimizing before it starts processing the data because it can reorder operations because they have executed yet.
1.3. Spark keeps a lineage of operations and recreates the partial failed state based on this lineage in case of failure.
Unified Ecosystem: Spark provides a unified programming model for various types of analysis - batch (spark-core), interactive (REPL), streaming (spark-streaming), machine learning (mllib), graph processing (graphx), SQL queries (SparkSQL)
Richer and Simpler API: Spark's API is richer and simpler. Richer because it supports many more operations (e.g., groupBy, filter ...). Simpler because of the expressiveness of these functional constructs. Spark's API supports Java, Scala and Python (for most APIs). There is experimental support for R.
Multiple Datastore Support: Spark supports many data stores out of the box. You can use Spark to analyze data in a normal or distributed file system, HDFS, Amazon S3, Apache Cassandra, Apache Hive and ElasticSearch to name a few. I'm sure support for many other popular data stores is comings soon. This essentially if you want to adopt Spark you don't have to move your data around.
For example, here is what code for word count looks in Spark (Scala).
val textFile = sc.textFile("some file on HDFS")
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
I'm sure you have to write a few more lines if you are using standard Hadoop MR.
Here are some common misconceptions about Spark.
Spark is just a in-memory cluster computing framework. However, this is not true. Spark excels when your data can fit in memory because memory access latency is lower. But you can make it work even when your dataset doesn't completely fit in memory.
You need to learn Scala to use Spark. Spark is written in Scala and runs on the JVM. But the Spark provides support for most of the common APIs in Java and Python as well. So you can easily get started with Spark without knowing Scala.
Spark does not scale. Spark is for small datasets (GBs) only and doesn't scale to large number of machines or TBs of data. This is also not true. It has been used successfully to sort PetaBytes of data
Finally, if you do not have a legacy codebase in Hadoop MR it makes perfect sense to adopt Spark, the simple reason being all major Hadoop vendors are moving towards Spark for good reason.
Apache Spark runs in memory, making it much faster than mapreduce.
Spark started as a research project at Berkeley.
Mapreduce use disk extensively (for external sort, shuffle,..).
As the input size for a hadoop job is in order of terabytes. Spark memory requirements will be more than traditional hadoop.
So basically, for smaller jobs and with huge memory in ur cluster, sparks wins. And this is not practically the case for most clusters.
Refer to spark.apache.org for more details on spark

Resources