Hadoop on cassandra database - hadoop

I am using Cassandra to store my data and hive to process my data.
I have 5 machines on which i have set up cassandra and 2 machines I use as analytics node(where hive runs)
So I want to ask is does hive do map reduce on just two machines(analytics nodes) and brings data there or it moves the process/computation to 5 cassandra nodes as well and process/compute the data on those machines.(What I know is in hadoop, process moves to data not data to process).

If you interested to marry Hadoop and Cassandra - the first link should DataStax company which is built around this concept. http://www.datastax.com/
They built and support hadoop with HDFS replaced with cassandra.
In best of my understanding - they do have data locality:http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/
There is good answer about Hadoop & Cassandra data locality if you run MapReduce against cassandra
Cassandra and MapReduce - minimal setup requirements
Regarding your question - there is a tradeof:
a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.
My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.

Related

Cassandra vs HDFS to store analytics data

we have an Apache Spark cluster that analyse data stored in HDFS (.parquet).
The solution is optimal in terms of performance but it's not disaster safe as we would like, indeed, HDFS architecture has a single point of failure (the namenode) even using two namenode (you just have 2 point of failure but it's not enough).
To improve our cluster fault tolerance we would like to move to another data store solution like Cassandra.
Questions are:
With Cassandra as datastore is Spark able to leverage on DataLocality as it do with HDFS?
How this change can affect the performance?
Thanks
Matteo
There's article about data locality, spark and Cassandra, so yes, it is possible:
https://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1
I didn't done any performance checks with Spark on HDFS vs Cassandra, and i believe it will vary depending on different workflows, but since Netflix and Microsoft using Cassandra with Spark, i believe performance is acceptable in most cases, and probably is a trade-off between data ingestion speed, existence/nonexistence of ETL and speed of the analytical process.
About hadoop single point of failure - If you will run Cassandra with replication factor 3 and consistency level quorum, you will get same 2 nodes down that will make data unavailable :) , keep it in mind.
And maybe consider MapR hadoop distribution, they've tried to solve namenode problem.

What is the Future of Apache Hbase

Well i wanted to know What is the Future of Hbase. As we know that it is used in case of Real-Time scenario But so is Cassandra & MongoDB. The only advantage HBase gets is that it comes along packaged with Cloudera / HDP distribution.
So how useful or effective is to get Deep into Hbase. .???
Per CAP (Consistency, Availability and Partition Tolerance) theorem the NoSQL databases can have only 2 of the 3 requirements. Depending that use case following 2 groups are formed.
HBASE and MongoDB are CP system (Consistency and partition tolerance)
Cassandra has AP (Availability and partition tolerance).
HBASE and Cassandra are column oriented However, MongoDB is document type.
HBASE architect is master-worker and Cassandra is master-less. Cassandra offers robust support for cluster spanning multiple datacenters.
Functional key difference is HBASE still has low latency read and write compared with Cassandra for high throughput.
https://en.wikipedia.org/wiki/Apache_HBase
https://en.wikipedia.org/wiki/Apache_Cassandra

Is HDFS necessary for Spark workloads?

HDFS is not necessary but recommendations appear in some places.
To help evaluate the effort spent in getting HDFS running:
What are the benefits of using HDFS for Spark workloads?
Spark is a distributed processing engine and HDFS is a distributed storage system.
If HDFS is not an option, then Spark has to use some other alternative in form of Apache Cassandra Or Amazon S3.
Have a look at this comparision
S3 – Non urgent batch jobs. S3 fits very specific use cases, when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
When to use HDFS as storage engine for Spark distributed processing?
If you have big Hadoop cluster already in place and looking for real time analytics of your data, Spark can use existing Hadoop cluster. It will reduce development time.
Spark is in-memory computing engine. Since data can't fit into memory always, data has to be spilled to disk for some operations. Spark will benifit from HDFS in this case. The Teragen sorting record achieved by Spark used HDFS storage for sorting operation.
HDFS is scalable, reliable and fault tolerant distributed file system ( since Hadoop 2.x release). With data locality principle, processing speed is improved.
Best for Batch-processing jobs.
The shortest answer is:"No, you don't need it". You can analyse data even without HDFS, but off course you need to replicate the data on all your nodes.
The long answer is quite counterintuitive and i'm still tryng to understand it with the help stackoverflow community.
Spark local vs hdfs permormance
HDFS (or any distributed Filesystems) makes distributing your data much simpler. Using a local filesystem you would have to partition/copy the data by hand to the individual nodes and be aware of the data distribution when running your jobs. In addition HDFS also handles failing nodes failures.
From an integration between Spark and HDFS, you can imagine spark knowing about the data distribution so it will try to schedule tasks to the same nodes where the required data resides.
Second: which problems did you face exactly with the instruction?
BTW: if you are just looking for an easy setup on AWS, DCOS allows you to install HDFS with a single command...
So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.
HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).
You could use S3 and Redshift.
See here:
https://github.com/databricks/spark-redshift

1 big Hadoop and Hbase cluster vs 1 Hadoop cluster + 1 Hbase cluster

Hadoop will run a lot of jobs by reading data from Hbase and writing data to
Hbase. Suppose I have 100 nodes, then there are two ways that I can build my Hadoop/Hbase
cluster:
100 nodes hadoop & hbase cluster (1 big Hadoop&Hbase)
Separate the Database(Hbase), then we have two clusters:
60 nodes Hadoop cluster and 40 nodes Hbase cluster (1 Hadoop + 1 Hbase)
which option is better? Why?
Thanks.
I would say option 2 is better.My reasoning - even though your requirement is mostly of running lots of mapreduce jobs to read and write data out of hbase, there are a lot of things going behind scene for hbase to optimise those reads and write for your submitted jobs. Hmaster will have to do load balancing often , unless your region keys are perfectly balanced. Table hotspotting can be there. For Regionserver, there will be major-compactions and if your jvm skills are not that good, then occasionally Stop the World garbage collection can happen. All the regions may start splitting at the same time. Your regionserver can go down and so on. Moot point is - tuning hbase needs time. If you have just one node dedicated for hbase then probability of aforementioned problems are higher. It's always better to have more than one node, so all the performance pressure doesn't apply to just one node. And by the way , scoring point of hbase is it's inherently distributed nature, you wouldn't want to kill it. All said, you can experiment on the ratio of nodes between hadoop and hbase- May be 70:30 or 80:20. Mileage may vary according to your application requirements.
The main reason to separate HBase and Hadoop is when they have different usage scenarios - i.e. HBAse does random read-write in low latency and Hadoop does sequential batches. In this case the different access patterns can interfere with each other and it can be better to separate the clusters.
If you're just using HBase in batch mode you can use the same cluster (and probably rethink using HBase since it is slower than raw hadoop in batch).
Note that you would need to tune HBase along the lines mentioned by Chandra Kant regardless of the path you take

MySQL Cluster vs. Hadoop for handling big data

I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework.
What is the better solution. I would like to read your opinion.
I think the advantages of using a MySQL Cluster are:
high availability
good scalability
high performance / real time data access
you can use commodity hardware
And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?
The advantages of Hadoop with Hive on top of it are:
also good scalability
you can also use commodity hardware
the ability to run in heterogenous environments
parallel computing with the MapReduce framework
Hive with HiveQL
and the disadvantage is:
no real time data access. It may takes minutes or hours to analyze the data.
So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?
Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.
Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.
If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?
Hadoop is not a replacement of MySQL, so I think they have their own scenario。
Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase.
If you wanna choose a offline compute & storage arch.
I suggest hadoop not MySQL cluster for offline compute & storage, because of :
Cost : obviously, hadoop cluster is more cheap than MySQL cluster
Scalability : hadoop support more than ten thousands machine in a cluster
Ecosystem : mapreduce, hive, pig, sqoop and etc.
So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture.
The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.
The answer to this question has a good explanation of this distinction.

Resources