Do I need to be able to fit my entire database in memory to us Oracle's Database In-Memory?
No, you can selectively declare a subset of your database to be in-memory. Since Database In-Memory is targeted at analytic workloads it populates selected objects into an in-memory area in columnar format. This allows analytic queries to scan the columnar data much faster than in the row format.
Related
I want to have a memory cache layer in my application. To populate cache with items, I have to get data from a large Cassandra table. Select all is not recommended, because without using partition keys, it's a slow read operation. Prior to that I can "predict" partition keys using other Cassandra table that I'll have to read all again, but relatively it's a smaller volume table. After reading data from user table and creating a list of potential partition keys (userX, userY) that may or may not be present in initial table. With that list try and populate cache by executing select queries with each potential key. That also doesn't sound like a really good idea.
So the question is? How to properly populate cache layer with data from Cassandra DB?
The second option is preferred for warming up or pre-loading your cache.
Single-partition asynchronous queries from multiple client/app instances is much better than doing a full table scan. Asynchronous queries from lots of clients distributes the load efficiently to all nodes in the cluster which is why they perform better.
It should be said that if you've got your data model right and you've sized your cluster correctly, you can achieve single-digit millisecond latencies. I work with a lot of large organisations who have a 95% SLA for 6-8ms reads. Cheers!
I am new to big data and am trying to understand the various ways of persisting and retrieving data.
I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase.
My questions are :
What is the use case of using Parquet instead HBase
Is there a use case where Parquet can be used together with HBase.
In case of performing joins will Parquet be better performant than
HBase (say, accessed through a SQL skin like Phoenix)?
As you have already said in question, parquet is a storage while HBase is storage(HDFS) + Query Engine(API/shell) So a valid comparison should be done between parquet+Impala/Hive/Spark and HBase. Below are the key differences -
1) Disk space - Parquet takes less disk space in comparison to HBase. Parquet encoding saves more space than block compression in HBase.
2) Data Ingestion - Data ingestion in parquet is more efficient than HBase. A simple reason could be point 1. As in case of parquet, less data needs to be written on disk.
3) Record lookup on key - HBase is faster as this is a key-value storage while parquet is not. Indexing in parquet will be supported in future release.
4) Filter and other Scan queries - Since parquet store more information about records stored in a row group, it can skip lot of records while scanning the data. This is the reason, it's faster than HBase.
5) Updating records - HBase provides record updates while this may be problematic in parquet as the parquet files needs to be re-written. A careful design of schema and partitioning may improve updates but it's not comparable with HBase.
By comparing the above features, HBase seems more suitable for situations where updates are required and queries involve mainly key-value lookup. Query involving key range scan will also have better performance in HBase.
Parquet is suitable for use cases where updates are very few and queries involves filters, joins and aggregations.
Our product is extracts from our database, they can be as large as 300GB+ in file format. To achieve that we join multiple large tables (tables close to 1TB in size in some cases). We do not aggregate data period, it's pure extracts. How does GreenPlum handle these kind of large data sets (The join keys are 3+ column keys and not every table has the same keys to join with, the only common key is the first key and if data would be distributed by that there will be a lot of skew since the data itself is not balanced).
You should use writable external tables for those types of large data extracts because it can leverage gpfdist and write data in parallel. It will be very fast.
https://gpdb.docs.pivotal.io/510/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
Also, your use case doesn't really indicate skew. Skew would be either storing the data by a poor column choice like gender_code or processing skew where you filter by a column or columns where only a few segments has the data.
In general, Greenplum Database handles this kind of load just fine. The query is executed in parallel on the segments.
Your bottleneck is likely the final export from the database - if you use SQL (or COPY), everything has to go through the master to the client. That takes time, and is slow.
As Jon pointed out, consider using an external table, and write out the data as it comes out of the query. Also avoid any kind of sort operation in your query, if possible. This is unnecessary because the data arrives unsorted in the external table file.
I need to compare the Indexing in Oracle Vs Hadoop(Hive). Up till now, I could find two major indexing techniques in Hive i.e. COMPACT INDEXING and BITMAP INDEXING. I could check out the performance difference of COMPACT INDEXING in Hive compared to Oracle. I would need to understand more use cases / scenarios of using Bitmap Indexing in Hive. Also, need to know if Hive supports Reverse Key Indexes , Ascending and Descending Indexes like Oracle.
YES their is significant advantages in using index in HIVE over
oracle, keeping in mind that HIVE is suitable for Large data sets and
yet their are developments in making HIVE a real time data
warehousing tool.
One use case in which BITMAP indexing can be used is where table with
columns having distinct values and obviously it should be a large
table (you will get better results if table is large, do not test
with small tables).
As of now HIVE Supports only two indexing techniques COMPACT and
BITMAP for explicitly creating indexes.
Also Indexes in Hive are not recommended (although you can create as
per your use case), the reason for this is ORC Format.
ORC format has build in Indexes which allow the format to skip blocks of
data during read, they also support Bloom filters index. Together
this pretty much replicates what Hive Indexes did and they do it
automatically in the data format without the need to manage an
external table ( which is essentially what happens in indexes).
I would suggest you to rather spend your time to properly setup the
ORC tables.
also read this great post about hive indexing.
hive is data warehousing tool that runs on hadoop. inbuilt it has mapreduce capacity for hive queries. the metadata and actula data are seperated and store in apache derby. so the burden on database is very less. hive process large tables easily because of distributive nature. and also you can compare the inner joins performance of oracle and hive. hive will gives you better performance always.
Assume that Oracle Coherence is free :)
Which one do you prefer?
What are the architectural and feature capability differences between Oracle Coherence(Tangosol) and Cassandra?
Best Regards
Oracle Coherence is a pure in-memory cache which can be distributed across nodes. Depending on its configuration it can have strong consistency, or eventual consistency for inserts and updates. Coherence is object based - consistent data model.
Since you buy Coherence from oracle - you can get commercial support, from oracle.
Cassandra is a bigtable data store that is distributed across nodes. No single point of failure. It uses some caching to improve performance before committing the data to disk in its implementation of bigTable. Cassandra requires some structure in its tuple (key/value/timestamp) but otherwise can support flexible data structures.
Preferences should be determined by your use case. They are both pretty cool in their own right.
You might also want to check out
- Terracotta in the in-memory space
- CouchDB and HBase as other players in the big table space.
Lets not forget Gemfire from Gemstone Systems, now owned by VMware (http://www.vmware.com/products/vfabric-gemfire/overview.html). Gemfire is an in memory distributed data fabric similar to Coherence and Terracotta but different in certain key ways. Each one has their pro's and cons but Gemfire is getting more support in a Spring sub project lately called spring-gemfire.
Both are NoSQL Databases. Currently there are 3 types of NoSQL databases that exists - Key Value Store, Tabular and Document Oriented. Coherence is a key value store, Cassandra is more like a tabular and MongoDB is a Document Oriented nosql db.