What are Riak advantages for Redis Key-Value store? - key-value-store

I have come across Riak being used as Key-Value store in major corporations. Wondering what are its distinguished features from popular Key-Value store like Redis?

The biggest difference is the "typical usage" for each. Redis is typically used as a high-speed in-memory cache for applications, whereas Riak is typically used as a scalable highly available persistent data store / database. Each has their strengths, and weaknesses, but it is similar to comparing apples and oranges.
Redis focuses on speed, inherent data structures, and does offer capabilities to cluster instances as master/slave and shard data, but with extra configuration.
Riak focuses on easy scalability, data safety, and hides the sharding / consistency models from most developers, but at the cost of higher latency since it writes to disk instead of to memory, and handles shards/replicas internally.
In the end it depends on what your engineering needs are.
Adron Hall has a good in-depth writeup here.
Disclosure: I work for Basho.

Related

What is the difference between a Big Data Warehouse and a traditional Data Warehouse

Usually, data warehouses in the context of big data are managed and implemented on the basis of Hadoop-based system, like Apache Hive (right?).
On the other hand, my question regards the methodological process.
How do big data affect the design process of a data warehouse?
Is the process similar or new tasks must be considered?
Hadoop is similar in architecture to MPP data warehouses, but with some significant differences. Instead of rigidly defined by a parallel architecture, processors are loosely coupled across a Hadoop cluster and each can work on different data sources.
The data manipulation engine, data catalog, and storage engine can work independently of each other with Hadoop serving as a collection point. Also critical is that Hadoop can easily accommodate both structured and unstructured data. This makes it an ideal environment for iterative inquiry. Instead of having to define analytics outputs according to narrow constructs defined by the schema, business users can experiment to find what queries matter to them most. Relevant data can then be extracted and loaded into a data warehouse for fast queries.
The Hadoop ecosystem starts from the same aim of wanting to collect together as much interesting data as possible from different systems, but approaches it in a radically better way. With this approach, you dump all data of interest into a big data store (usually HDFS – Hadoop Distributed File System). This is often in cloud storage – cloud storage is good for the task, because it’s cheap and flexible, and because it puts the data close to cheap cloud computing power. You can still then do ETL and create a data warehouse using tools like Hive if you want, but more importantly you also still have all of the raw data available so you can also define new questions and do complex analyses over all of the raw historical data if you wish. The Hadoop toolset allows great flexibility and power of analysis, since it does big computation by splitting a task over large numbers of cheap commodity machines, letting you perform much more powerful, speculative, and rapid analyses than is possible in a traditional warehouse.

Is Cassandra for OLAP or OLTP or both?

Cassandra does not comply with ACID like RDBMS but CAP. So Cassandra picks AP out of CAP and leaves it to the user for tuning consistency.
I definitely cannot use Cassandra for core banking transaction because C* is slightly inconsistent.
But Cassandra writes are extremely fast which is good for OLTP.
I can use C* for OLAP because reads are extremely fast which is good for reporting too.
So i understood that C* is good only when your application do not need your data to be consistent for some amount of time but reads and writes should be quick?
If my understanding is right kindly list some applications?
ACID are properties of relational databases where BASE are properties of most nosql databases and Cassandra is one of the. CAP theorem just explains the problem of consistency, availability and partition tolerance in distributed systems. Good thing about Cassandra is that it has tunable consistency so you can be pretty much consistent (at the price of partition tolerance) so OLTP is doable. As phact said there are even some banks that built their transaction software on top of Cassandra. OLAP is also doable but not with just Cassandra since its partitioned row storage limits its capabilities. You need to have something like Spark to be able to do complex queries required.
Cassandra should be avoided for OLTP applications , even they state that it might not be the perfect use case for OLTP.Even though you can achieve a fully consistent model with setting Write Consistency to All , this would make writing rather a tough process , for the coordinator node to write that data to all partitions of all replicated nodes.And also if your Cassandra system is massively replicated across different Data Centers, maybe across different Continents then the time taken to write will increase dramatically.

Hadoop comparison to RDBMS

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms
RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
EDIT:
To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.
Logging. Assembling log records and tracking down all changes
in database structures slows performance. Logging may not be
necessary if recoverability is not a requirement or if recoverability
is provided through other means (e.g., other sites on the network).
Locking. Traditional two-phase locking poses a sizeable overhead
since all accesses to database structures are governed by a
separate entity, the Lock Manager.
Latching. In a multi-threaded database, many data structures
have to be latched before they can be accessed. Removing this
feature and going to a single-threaded approach has a noticeable
performance impact.
Buffer management. A main memory database system does not
need to access pages through a buffer pool, eliminating a level of
indirection on every record access.
How Hadoop handles?:
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.
This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.
Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.
Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.
This has been achieved with Name Node, Data Node & Resource Manager.
To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation
Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.
But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.
From slideshare.net
Have a look at one more related SE question :
NoSql vs Relational database
First, hadoop IS NOT a DB replacement.
RDBMS scale vertical and hadoop scale horizontal.
This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.
Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.
The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.
There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.
You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .
In RDBMS , data is structured , rather it is indexed.
Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.
where as in Hadoop, say Hive, we load the only the particular column from the entire data set.
More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.
Hence, two advantages of using Hadoop and its tools.

Handling Big Data in a Datawarehouse [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am a learner in Big data concepts. Based on my understanding Big Data is critical in handling unstructured data and high volume.When we look at the big data architecture for a datawarehouse (DW) the data from source is extracted through the Hadoop (HDFS and Mapreduce) and the relevant unstructured information is converted to a valid business information and finally data is injected to the DW or DataMart through ETL processing (along with the existing sturctured data processing).
However i would like to know what are the new techniques/new dimensional model or storage requirements required at DW for an organization (due to the Big Data) as most of the tutorials/resources i try to learn only talks about Hadoop at source but not at target. How does the introduction of Big Data impacts the predefined reports/adhoc analysis of an organization due to this high volume of data
Appreciate your response
That is a very broad question, but I'll try to give some answers.
Hadoop can be a data source, a data warehouse, or a "data lake", being a repository of data from which warehouses and marts may be drawn.
The line between Hadoop and RDBMS-based data warehouses is increasingly blurred. As SQL-on-Hadoop becomes a reality, interacting with Hadoop-based data becomes increasingly easy. To be effective, though, there must be structure in the data.
Some examples of Hadoop/DW interactions:
Microsoft Application Platform System, with Polybase interaction between SQL Server and Hadoop
Impala (Cloudera), Stinger (Hortonworks) and others providing SQL-on-Hadoop
Actian and Vertica (HP) providing RDBMS-compatible MPP on Hadoop
That said, Hadoop DW is still immature. It is not quite as performant as RDBMS-based DW, lacks many security and operational features, as well as lacking in SQL capability. Think carefully about your needs before taking this path.
Another question you should ask is whether you actually need a platform of this type. Any RDBMS can handle 3-5Tb of data. SQL Server and PostgreSQL are two examples of platforms that would handle a DW on commodity hardware, and negligible administration.
Those same RDBMS can handle 100Tb workloads, but they require much more care and feeding at this scale.
MPP RDBMS appliances handle data workloads into the Petabyte range, with lower administrative and operational overhead as they scale. I doubt you get to that scale, very few companies do :) You might choose an MPP appliance for a much smaller data volume, if speed of complex queries was your most important factor. I've seen MPP appliances deployed on data volumes as small as 5Tb for this reason.
Depending on the load technique, you will probably find that an RDBMS-based DW is faster to load than Hadoop. For example, I load hundreds of thousands of rows per second into PostgreSQL, and slightly less than that into SQL Server. It takes substantially longer to achieve the same result in Hadoop as I have to ingest the file, establish it in Hive, and move it to Parquet to get a similar level of output performance. Over time I expect this to change in Hadoop's favour, but it isn't quite there, yet.
You mentioned Dimensional Modelling. If your star schema is comprised of transactional fact tables and SCD0-SCD1 dimensions, thus needing insert-only processing, you might have success with SQL-on-Hadoop. If you need to update the facts (accumulating snapshots) or dimensions (SCD2, SCD3) you might struggle with both capability and performance - a lot of implementations don't yet support UPDATE queries, and those that do are slow.
Sorry that there isn't a simple "Do this!" answer, but this is a complex topic in an immature field. I hope these comments help your thinking.
The process of data lake and data warehouse is not the same. Dimensional modeling in traditional sense starts with business process identification and star schema design where on data lake you don't commit any assumption about the business process.The data lake collects the data at a very granular level as possible, explore it and find the business process. You can read more about data lake on An Introduction to enterprise data lake - The myths and miracles

Is Hadoop a good candidate for use as a key-value store?

Question
Would Hadoop be a good candidate for the following use case:
Simple key-value store (primarily needs to GET and SET by key)
Very small "rows" (32-byte key-value pairs)
Heavy deletes
Heavy writes
On the order of a 100 million to 1 billion key-value pairs
Majority of data can be contained on SSDs (solid state drives) instead of in RAM.
More info
The reason I ask is because I keep seeing references to the Hadoop file system and how Hadoop is used as the foundation for a lot of other database implementations that aren't necessarily designed for Map-Reduce.
Currently, we are storing this data in Redis. Redis performs great, but since it contains all of its data within RAM, we have to use expensive machines with upwards of 128gb RAM. It would be nice to instead use a system that relies on SSDs. This way we would have the freedom to build much bigger hash tables.
We have also stored this data using Cassandra, but Cassandra tends to "break" if the deletes become too heavy.
Hadoop (unlike popular media opinions) is not a database. What you describe is a database. Thus Hadoop is not a good candidate for you. Also the below post is opinionated, so feel free to prove me wrong with benchmarks.
If you care about "NoSql DB's" that are on top of Hadoop:
HBase would be suited for heavy writes, but sucks on huge deletes
Cassandra same story, but writes are not as fast as in HBase
Accumulo might be useful for very frequent updates, but will suck on deletes as well
None of them make "real" use of SSDs, I think that all of them do not get a huge speedup by them.
All of them suffer from the costly compactions if you start to fragment your tablets (in BigTable speech), thus deleting is a fairly obvious limiting factor.
What you can do to mitigate the deletion issues is to just overwrite with a constant "deleted" value, which work-arounds the compaction. However, grows your table which can be costly on SSDs as well. Also you will need to filter, which likely affects the read latency.
From what you describe, Amazon's DynamoDB architecture sounds like the best candidate here. Although deletes here are also costly- maybe not as much as the above alternatives.
BTW: the recommended way of deleting lots of rows from the tables in any of the above databases is to just completely delete the table. If you can fit your design into this paradigm, any of those will do.
Although this isnt an answer to you question, but in context with what you say about
It would be nice to instead use a system that relies on SSDs. This way
we would have the freedom to build much bigger hash tables.
you might consider taking a look at Project Voldemort.
Specifically being a Cassandra user I know when you say Its the compaction and the tombstones that are a problem. I have myself ran into TombstoneOverwhelmingException couple of times and hit dead ends.
You might want to have a look at this article by Linked In
It says:
Memcached is all in memory so you need to squeeze all your data into
memory to be able to serve it (which can be an expensive proposition
if the generated data set is large).
And finally
all we do is just mmap the entire data set into the process address
space and access it there. This provides the lowest overhead caching
possible, and makes use of the very efficient lookup structures in the
operating system.
I dont know if this fits your case. But you can consider evaluating Voldemort once! Best of luck.

Resources