Cassandra vs HDFS to store analytics data - performance

we have an Apache Spark cluster that analyse data stored in HDFS (.parquet).
The solution is optimal in terms of performance but it's not disaster safe as we would like, indeed, HDFS architecture has a single point of failure (the namenode) even using two namenode (you just have 2 point of failure but it's not enough).
To improve our cluster fault tolerance we would like to move to another data store solution like Cassandra.
Questions are:
With Cassandra as datastore is Spark able to leverage on DataLocality as it do with HDFS?
How this change can affect the performance?
Thanks
Matteo

There's article about data locality, spark and Cassandra, so yes, it is possible:
https://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1
I didn't done any performance checks with Spark on HDFS vs Cassandra, and i believe it will vary depending on different workflows, but since Netflix and Microsoft using Cassandra with Spark, i believe performance is acceptable in most cases, and probably is a trade-off between data ingestion speed, existence/nonexistence of ETL and speed of the analytical process.
About hadoop single point of failure - If you will run Cassandra with replication factor 3 and consistency level quorum, you will get same 2 nodes down that will make data unavailable :) , keep it in mind.
And maybe consider MapR hadoop distribution, they've tried to solve namenode problem.

Related

Spark with HBASE vs Spark with HDFS

I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.
Questions:
1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?
2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?
1) What are the added capabilities brought by layering Spark on top of
HBASE instead of using HBASE solely? It depends only on programmer
capabilities or is there any performance reason to do that? Are there
things Spark can do and HBASE solely can't do?
At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.
2) Stemming from previous question, when you should add HBASE between
HDFS and SPARK instead of using directly HDFS?
Small reads, concurrent write/read patterns, incremental updates (most etl)
Good luck...
I'd say that using distributed computing engines like Apache Hadoop or Apache Spark imply basically a full scan of any data source. That's the whole point of processing the data all at once.
HBase is good at cherry-picking particular records, while HDFS certainly much more performant with full scans.
When you do a write to HBase from Hadoop or Spark, you won't write it to database is usual - it's hugely slow! Instead, you want to write the data to HFiles directly and then bulk import them into.
The reason people invent SQL databases is because HDDs were very very slow at that time. It took the most clever people tens of years to invent different kind of indexes to clever use the bottleneck resource (disk). Now people try to invent NoSQL - we like associative arrays and we need them be distributed (that's what essentially what NoSQL is) - they're very simple and very convenient. But in todays world with SSDs being cheap no one needs databases - file system is good enough in most cases. The one thing, though, is that it has to be distributed to keep up the distributed computations.
Answering original questions:
These are two different tools for completely different problems.
I think if you use Apache Spark for data analysis, you have to avoid HBase (Cassandra or any other database). They can be useful to keep aggregated data to build reports or picking specific records about users or items, but that's happen after the processing.
Hbase is a No SQL data base that works well to fetch your data in a fast fashion. Though it is a db, it used large number of Hfile(similar to HDFS files) to store your data and a low latency acces.
So use Hbase when it suits a requirement that your data needs to accessed by other big data.
Spark on the other hand, is the in-memory distributed computing engine which have connectivity to hdfs, hbase, hive, postgreSQL,json files,parquet files etc.
There is no considerable performance change while reading from a HDFS file or Hbase upto some gbs. After that Hbase connectivity is becoming faster....

Is HDFS necessary for Spark workloads?

HDFS is not necessary but recommendations appear in some places.
To help evaluate the effort spent in getting HDFS running:
What are the benefits of using HDFS for Spark workloads?
Spark is a distributed processing engine and HDFS is a distributed storage system.
If HDFS is not an option, then Spark has to use some other alternative in form of Apache Cassandra Or Amazon S3.
Have a look at this comparision
S3 – Non urgent batch jobs. S3 fits very specific use cases, when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
When to use HDFS as storage engine for Spark distributed processing?
If you have big Hadoop cluster already in place and looking for real time analytics of your data, Spark can use existing Hadoop cluster. It will reduce development time.
Spark is in-memory computing engine. Since data can't fit into memory always, data has to be spilled to disk for some operations. Spark will benifit from HDFS in this case. The Teragen sorting record achieved by Spark used HDFS storage for sorting operation.
HDFS is scalable, reliable and fault tolerant distributed file system ( since Hadoop 2.x release). With data locality principle, processing speed is improved.
Best for Batch-processing jobs.
The shortest answer is:"No, you don't need it". You can analyse data even without HDFS, but off course you need to replicate the data on all your nodes.
The long answer is quite counterintuitive and i'm still tryng to understand it with the help stackoverflow community.
Spark local vs hdfs permormance
HDFS (or any distributed Filesystems) makes distributing your data much simpler. Using a local filesystem you would have to partition/copy the data by hand to the individual nodes and be aware of the data distribution when running your jobs. In addition HDFS also handles failing nodes failures.
From an integration between Spark and HDFS, you can imagine spark knowing about the data distribution so it will try to schedule tasks to the same nodes where the required data resides.
Second: which problems did you face exactly with the instruction?
BTW: if you are just looking for an easy setup on AWS, DCOS allows you to install HDFS with a single command...
So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.
HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).
You could use S3 and Redshift.
See here:
https://github.com/databricks/spark-redshift

Control data locality in Impala by partitioning

I would like to avoid Impala nodes unnecessarily requesting data from other nodes over the network in cases when the ideal data locality or layout is known at table creation time. This would be helpful with 'non-additive' operations where all records from a partition are needed at the same place (node) anyway (for ex. percentiles).
Is it possible to tell Impala that all data in a partition should always be co-located on a single node for any HDFS replica?
In Impala-SQL, I am not sure if the "PARTITIONED BY" clause provide this feature. In my understanding, Impala chunks its partitions into separate files on HDFS but HDFS does not guarantee the co-location of related files nor blocks by default (rather tries to achieve the opposite).
Found some information about Impala's impact on HDFS development but not clear if these are already implemented or still in plans:
http://www.slideshare.net/deview/aaron-myers-hdfs-impala
(slides 23-24)
Thank you in advance for all.
About the slides you mention ("Co-located block replicas") - it's about an HDFS feature (HDFS-2576) implemented in Hadoop 2.1. It provides a Java API to give hints to HDFS as to where the blocks should be placed.
It's not used in Impala as of 2014, but it definitely seems like building some groundwork for that - as it would give Impala a performance equivalent of specifying distribution key in traditional MPP databases.
No, that completely defeats the purpose of having a distributed file system and MPP computing. It also creates a single point of failure and a bottleneck especially if you're talking about a 250GB table that is joined to itself. Exactly the kind of problems that Hadoop was designed to solve. Partitioning data creates sub-directories in HDFS on the namenode and that data is then replicated throughout the datanodes in the cluster.

MySQL Cluster vs. Hadoop for handling big data

I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework.
What is the better solution. I would like to read your opinion.
I think the advantages of using a MySQL Cluster are:
high availability
good scalability
high performance / real time data access
you can use commodity hardware
And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?
The advantages of Hadoop with Hive on top of it are:
also good scalability
you can also use commodity hardware
the ability to run in heterogenous environments
parallel computing with the MapReduce framework
Hive with HiveQL
and the disadvantage is:
no real time data access. It may takes minutes or hours to analyze the data.
So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?
Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.
Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.
If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?
Hadoop is not a replacement of MySQL, so I think they have their own scenario。
Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase.
If you wanna choose a offline compute & storage arch.
I suggest hadoop not MySQL cluster for offline compute & storage, because of :
Cost : obviously, hadoop cluster is more cheap than MySQL cluster
Scalability : hadoop support more than ten thousands machine in a cluster
Ecosystem : mapreduce, hive, pig, sqoop and etc.
So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture.
The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.
The answer to this question has a good explanation of this distinction.

Hadoop on cassandra database

I am using Cassandra to store my data and hive to process my data.
I have 5 machines on which i have set up cassandra and 2 machines I use as analytics node(where hive runs)
So I want to ask is does hive do map reduce on just two machines(analytics nodes) and brings data there or it moves the process/computation to 5 cassandra nodes as well and process/compute the data on those machines.(What I know is in hadoop, process moves to data not data to process).
If you interested to marry Hadoop and Cassandra - the first link should DataStax company which is built around this concept. http://www.datastax.com/
They built and support hadoop with HDFS replaced with cassandra.
In best of my understanding - they do have data locality:http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/
There is good answer about Hadoop & Cassandra data locality if you run MapReduce against cassandra
Cassandra and MapReduce - minimal setup requirements
Regarding your question - there is a tradeof:
a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.
My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.

Resources