cassandra and hadoop - realtime vs batch - hadoop

As per http://www.dbta.com/Articles/Columns/Notes-on-NoSQL/Cassandra-and-Hadoop---Strange-Bedfellows-or-a-Match-Made-in-Heaven-75890.aspx
Cassandra has pursued somewhat different solutions than has Hadoop. Cassandra excels at high-volume real-time transaction processing, while Hadoop excels at more batch-oriented analytical solutions.
What are the differences in the architecture/implementation of Cassandra and Hadoop which account for this sort of difference in usage. (in lay software professional terms)

I wanted to add, because I think there might be a misleading statement here saying Cassandra might perform good for reads.
Cassandra is not very good at random reads either, it's good compared to other solutions out there in how can you read randomly over a huge amount of data, but at some point if the reads are truly random you can't avoid hitting the disk every single time which is expensive, and it may come down to something useless like a few thousand hits/second depending on your cluster, so planning on doing lots of random queries might not be the best, you'll run into a wall if you start thinking like that. I'd say everything in big data works better when you do sequential reads or find a way to sequentially store them. Most cases even when you do real time processing you still want to find a way to batch your queries.
This is why you need to think beforehand what you store under a key and try to get the most information possible out of a read.
It's also kind of funny that statement says transaction and Cassandra in the same sentence, cause that really doesn't happen.
On the other hand hadoop is meant to be batch almost by definition, but hadoop is a distributed map reduce framework, not a db, in fact, I've seen and used lots of hadoop over cassandra, they're not antagonistic technologies.
Handling your big data in real time is doable but requires good thinking and care about when and how you hit the database.
Edit: Removed secondary indices example, as last time I checked that used random reads (though I've been away from Cassandra for more than a year now).

The Vanilla hadoop consists of a Distributed File System (DFS) at the core and libraries to support Map Reduce model to write programs to do analysis. DFS is what enables Hadoop to be scalable. It takes care of chunking data into multiple nodes in a multi node cluster so that Map Reduce can work on individual chunks of data available nodes thus enabling parallelism.
The paper for Google File System which was the basis for Hadoop Distributed File System (HDFS) can be found here
The paper for Map Reduce model can be found here
For a detailed explanation on Map Reduce read this post
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. It is not a conventional database but is more like Hashtable or HashMap which stores a key/value pair. Cassandra works on top of HDFS and makes use of it to scale. Both Cassandra and HBase are implementations of Google's BigTable. Paper for Google BigTable can be found here.
BigTable makes use of a String Sorted Table (SSTable) to store key/value pairs. SSTable is just a File in HDFS which stores key followed by value. Furthermore BigTable maintains a index which has key and offset in the File for that key which enables reading of value for that key using only a seek to the offset location. SSTable is effectively immutable which means after creating the File there is no modifications can be done to existing key/value pairs. New key/value pairs are appended to the file. Update and Delete of records are appended to the file, update with a newer key/value and deletion with a key and tombstone value. Duplicate keys are allowed in this file for SSTable. The index is also modified with whenever update or delete take place so that offset for that key points to the latest value or tombstone value.
Thus you can see Cassandra's internal allow fast read/write which is crucial for real time data handling. Whereas Vanilla Hadoop with Map Reduce can be used to process batch oriented passive data.

Hadoop consists of two fundamental components: distributed datastore (HDFS) and distributed computation framework (MapReduce). It reads a bunch of input data then writes output from/to the datastore. It needs distributed datastore since it performs parallel computing with the local data on cluster of machines to minimize the data loading time.
While Cassandra is the datastore with linear scalability and fault-tolerance ability. It lacks of the parallel computation ability provided by MapReduce in Hadoop.
The default datastore (HDFS) of Hadoop can be replaced with other storage backend, such as Cassandra, Glusterfs, Ceph, Amazon S3, Microsoft Azure's file system, MapR’s FS, and etc. However, each alternatives has its pros and cons, they should be evaluated based on the needs.
There are some resources that help you integrate Hadoop with Cassandra: http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configHadoop.html

Related

Spark performance advantage vs. Hadoop MapReduce [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/

Purpose of Hadoop MapReduce

Currently I am reading some papers about Hadoop and the popular MapReduce algorithm. However, I could not see the value of the MapReduce and will be glad if someone can give some insight about it. Specifically:
It is said that MapReduce receives a file and produces key value pairs. What is a key? Just a word, a combination of words or something else? If the key is the words in the file then what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.
If everything is converted to key value pairs then what Hadoop does is just creating a Dictionary like in JAVA and C#, wright? May be Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?
What do I earn by converting a file to key value pairs? I know I can find the counts and frequencies of the words, but for what? What may be the purpose of counting the number of words?
It is said that Hadoop can be used for unstructured data. If everything is converted to a key value pair, then it so normal that Hadoop can work with unstructured data! I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?
The questions may seem to be correlated with each other but I believe that I gave the idea about my question. I will be glad if you can give answers to the above questions.
Regards,
Edit:
Hi Guys,
Thank you very much for your responses.
What I understood from your writings and playing with Hadoop a little bit, I would like to state my conclusions in a very high-level basic way:
Hadoop process data through key-value pairs. Everything is converted to key-value pairs.
The main interest should be given to the definitions of the key and the value which may change according to business needs.
Hadoop provides just an efficient (e.g. distributed, scalable system and huge amount of data handling) implementation of a dictionary, nothing more.
Any comments on these outcomes are welcome.
As a final note I would like to add that, for a simple map-reduce implementation I believe that there should be a user interface which enables user to select/define the keys and appropriate values. This UI may also be extended for further statistical analysis.
Regards,
It is said that MapReduce receives a file and produces key value pairs. What is a key? Just a word, a combination of words or something
else? If the key is the words in the file then what is purpose of
writing code for MapReduce? MapReduce should do the same thing without
implementing specific algorithm.
MapReduce should be visualized as distributed computing framework. For word count example the key is word, but we can have any thing as key (APIs are available for some of them and we can write custom ones as well). The purpose of having the key is to partition, sort and merge the sorted data to perform aggregations. A map phase will be used to perform row level transformations, filtering etc and reduce phase will take care of aggregation. Map and Reduce needs to be implemented and then shuffle phase which is typically out of the box will take care of partitioning, shuffling, sorting and merging.
If everything is converted to key value pairs then what Hadoop does is just creating a Dictionary like in JAVA and C#, wright? May be
Hadoop can create the dictionary in a more efficient way. Other than
efficiency what Hadoop provides that a normal Dictionary object
cannot?
Covered as part of previous question.
What do I earn by converting a file to key value pairs? I know I can find the counts and frequencies of the words, but for what? What
may be the purpose of counting the number of words?
You can perform transformations, filtering, aggregations, joins and any custom task that can be performed on unstructured data. The major difference is distributed. Hence it can scale better than any legacy solutions.
It is said that Hadoop can be used for unstructured data. If everything is converted to a key value pair, then it so normal that
Hadoop can work with unstructured data! I can write a program in C# to
generate the key value pairs instead of using Hadoop. What is the real
value of Hadoop that I cannot utilize by using other kinds of
programming tools?
Key can be line offset and then you can process each record. It does not matter if every record is of same structure or different.
Here are the advantages of using Hadoop:
Distributed file system (HDFS)
Distributed processing framework (map reduce)
Data locality (typically in modern applications, files will be network mounted and hence data which is bigger than code has to be copied to servers on which code is deployed. In hadoop code goes to data and all the success stories of Hadoop does not use network file system)
Limited usage of network while storing and processing very large data sets
Cost effective (open source softwares on commodity hardware)
and many more.
Take an example for Word count example to get better understanding.
What is a key? Just a word, a combination of words or something else?
For Mapper:
Key is offset value from beginning of the file. Value is entire line. Once the line is read from file, the line will be split into multiple key value pairs for Reducer. Delimiter like tab or space or characters like , : helps to split line to key value pairs.
For Reducer:
Key is individual word. Value is occurrence of the word.
Once you get key value pairs at reducer, you can run many aggregation/stigmatization/categorization of data and provide analytical summary of data.
Have a look at this use case article which covers Financial, Energy, Telecom, Retail etc.
Have a look at this article for better understanding of entire word count example and Map reduce tutorial.
what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.
Hadoop have four key components.
1. Hadoop Common: The common utilities that support the other Hadoop modules.
2. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
3. Hadoop YARN: A framework for job scheduling and cluster resource management.
4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
May be Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?
Creating the dictionary is not the core purpose. Hadoop is creating this dictionary and uses these key value pairs to solve business use cases later depending on requirement.
Word count example may provide output as just Word & Word count. But you can process Structured/Semi-Sturctured & Un-Structured data for various use cases
Find the hottest day of year/month/day/hour for a given place in entire universe.
Find the number of buy/sell transactions of a particular stock in NYSE on a given day. Provide Minute wise/hour wise/Day wise summary of transactions per a stock. Find top 10 highly traded stocks on a given day
Find the number of tweets/re-tweets for a particular tag key
What may be the purpose of counting the number of words?
Explained the purpose in earlier answers.
I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?
How much data volume you can handle by writing C# to get key value pairs and process data? Can you process 10 peta bytes of weather information in 5000 node cluster using C# with distributed storage/processing framework developed in C#?
How do you summarize the data Or find top 10 cool/hot places using C#?
You have to develop some framework to do all of these things and Hadoop has already come-up with that framework.
HDFS is used for distributed storage of data in volumes of peta bytes. If you need to handle data growth, just add some more nodes to hadoop cluster.
Hadoop Map reduce & YARN provide framework for distributed data processing to process data stored in thousands of machines in Hadoop cluster.
Image source: kickstarthadoop ( article author: Bejoy KS)

Map Reduce & RDBMS

I was reading hadoop definitive guide , It was written Map Reduce is good for updating larger portions of the database , and it uses Sort & Merge to rebuild the database which is dependent on transfer time .
Also RDBMS is good for updating only smaller portions of a big database , It uses a B-Tree which is limited by seek time
Can anyone elaborate on what both these claims really mean ?
I am not really sure what the book means, but you will usually do a map reduce job to rebuild the entire database/anything if you still have the raw data.
The real good thing about hadoop is that it's distributed, so performance is not really a problem since you could just add more machines.
Let's take an example, you need to rebuild a complex table with 1 billion rows. With RDBMS, you can only scale vertically, so you will be depending more on the power of the CPU, and how fast the algorithm is. You will be doing it with some SQL command. You will need to select a few data, process them, do stuffs, etc. So you will most likely be limited by the seek time.
With hadoop map reduce, you could just add more machines, so performance is not the problem. Let's say you you use 10000 mappers, that means the task will be divided to 10000 mapper containers, and because of hadoop's nature, all these containers usually already have the data on their harddrive stored locally. The output of each mapper is always a key value structured format on their local harddrive. These data are sorted using the key by the mapper.
Now the problem is, they need to combine the data together, so all of these data will be sent to a reducer. This happens through the network, is usually the slowest part if you have big data. The reducer will receive all of the data and will merge-sort them for further processing. In the end you have a file which could be just uploaded to your database.
The transfer from mapper to reducer is usually what's taking the longest time if you have a lot of data, and network is usually your bottleneck. Maybe this is what it meant by depending on the transfer time.

Is Hadoop the right tech for this?

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Difference between MapReduce of data store like couchdb and that of Hadoop?

Recently on a webinar by Couchbase, they said that Hadoop be used for processing large log file and Couchbase for presenting it to the application layer. They claimed that the map and reduce of Couchbase and Hadoop was different and suitable for the respective use case mentioned.
I was going to use Couchbase map reduce for processing large amouont of log file.
Can some one please clarify the exact difference between the two map reduce? Are there any features in Hadoop which makes it more suitable for processing large log files?
Thanks...
the main difference in the fact that couchbase uses incremental map/reduce and won't scan all the data set one you need to update or remove the items. another difference is the magnitude of "large". if you need to process hundreds of gigabytes of logs once then the couchbase isn't.the best choice.
Couchbase is one of many NoSQL data storage applications. Data is stored in Key / Value pairs, with the keys indexed for quick retrieval.
Conversely data in hadoop is not indexed (other than the file name), and pulling a specific value from a file in HDFS is much slower, possibly involving scanning of many files.
You would typically use something like Hadoop mapreduce to process large files, and update / populate a NoSQL store (such as Couchbase).
Using a NoSQL datastore for processing large amounts of data will most probably be less efficient than using MapReduce to do the same job. But the NoSQL datastore will be able to service a web layer considerably more efficiently than a MapReduce job (which can take 10's of seconds to initialize, and minutes / hours to run).

Resources