I have a client that mostly uses calculations on a single column of many rows from a table (each time another column), which is classic for a columnar DB.
The problem is that he is using Oracle, so what I thought of doing was to build a bunch of cluster table where each table has just one column besides the PK and this way allow him to work in a pseudo-columnar model.
What are you thoughts on the subject?
Will it even work as expected or am I just forcing the solution here ?
Thanks,
Daniel
I didn't test it in the end but I did achieve close to vertical performance time using sorted table hash cluster.
Related
We are building a DB infrastructure on the top of Hadoop systems. We will be paying a vendor to do that and I do not think we are getting the right answers from the first vendor. So, I need the help from some experts to validate if I am right or I miss something
1. We have about 1600 fields in the data. A unique record is identified by those 1600 records
We want to be able to search records in a particular timeframe
(aka, records for a given time frame)
There are some fields that change overtime (monthly)
The vendor stated that the best way to go is HBASE and that they have to choices: (1) make the search optimize for machine learning (2) make adhoc queries.
The (1) will require a concatenate key with all the fields of interest. The key length will determine how slow or fast the search will run.
I do not think this is correct.
1. We do not need to use HBASE. We can use HIVE
2. We do not need to concatenate field names. We can translate those to a number and have a key as a number
3. I do not think we need to choose one or the other.
Could you let me know what you think about that?
It all depends on what is your use case. In simpler terms, Hive alone is not good when it comes to interactive queries however one of the best when it comes to analytics.
Hbase on the other hand, is really good for interactive queries, however doing analytics would not be that easy as hive.
We have about 1600 fields in the data. A unique record is identified by those 1600 records
HBase
Hbase is a NoSQL, columner database which stores information in Map(Dictionary) like format. Where each row needs to have one column which uniquely identifies the row. This is called key.
You can have key as a combination of multiple columns as well if you don't have a single column which can uniquely identifies the row. And then you can search record using partial key. However this is going to affect the performance ( as compared to have single column key).
Hive:
Hive has a SQL like language (HQL) to Query HDFS, which you can use for analytics. However, it doesn't require any primary key so you can insert duplicate records if required.
The vendor stated that the best way to go is HBASE and that they have
to choices: (1) make the search optimize for machine learning (2) make
adhoc queries. The (1) will require a concatenate key with all the
fields of interest. The key length will determine how slow or fast the
search will run.
In a way your vendor is correct, as I explained earlier.
We do not need to use HBASE. We can use HIVE 2. We do not need to concatenate field names. We can translate those to a number and have a key as a number 3. I do not think we need to choose one or the other.
Weather you can use HBASE or Hive is depends on your use case. However, if you are planning to use Hive then you don't even need to generate a pseudo key (row numbers you are talking about)
There is one more option if you have hortonworks deployment. Consider Hive for analytics and LLAP for interactive queries.
I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:
I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?
Load source data into a hive table; let say Sales_Data_Warehouse
Generate Dimension from sales_data_warehouse; ex:
SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse
When all dimensions are done then load the fact table like
SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key...
FROM Sales_Data_Warehouse AS 'source'
JOIN Customer_Dimension Customer on source.Customer_Name =
Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address
JOIN Store_Dimension AS 'Store' ON
Store.Store_Name = Source.Store_Name
JOIN Product_Dimension AS 'Product' ON .....
Is this the way I should load my fact and dimension table in hive?
Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?
We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.
The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table
I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.
I am using golden gate to replicate table from one DB to multiple DB's. The challenging part is that in one DB the table should be replicated full (all table columns), but in the rest of the DBs the table needs to be half replicated, meaning just a few columns, not all.
Is it possible to have columns exception at the replication level ?
I know that is it possible at the extract level, but this doesn't fit to my scenario.
COLSEXCEPT is an EXTRACT parameter only. It cannot be used in replication.
For tables with large number of columns, using COLEXCEPT can help in excluding some columns instead of entering all the columns in the extract file.
You need to solve this in the REPLICAT side by mapping the necessary columns to the target table using COLMAP. I think USEDEFAULTS will not work in this case for REPLICAT since you mentioned that you need few columns only(Does that mean the table structure is different from SOURCE to TARGET???)
I am developing an enterprise application with an Oracle backend. I am designing a core part of the DB architecture now and im having some questions on it.
First and most important thing is, most of my tables needs to preserve old data. For example
Consider a table with the fields
Contract No, Contract Name, Contract Person, Contract Email
I have a records like
12, xxx, yyy, xxx#zzz.ccc
and some one modifies it to
12, xxx, zzz, xxx#zzz.ccc
at any point of time i need to display the new record while still have copy of the old record.
So what i thought was to put a duplicate record of the old data and update the fields that was changed and have a flag to keep track of active records with something like "is active" as 1.
The downside is that this creates redundancy in the table and seems like a bad design. But any other model seems unnecessarily complex and this seems cleaner to me. Also i dont see any performance issues having a duplicate record too. So please let me know if this is ok or am i missing something here.
Some times where there is a one to many relationship my assumption is to have a mapping table where i map the multiple entity in individual records by repeating master ID and changing child ID in each record. Is this a right way to do it or is there a better way to do it.
Is there a book on database best practices.
Thanks.
The database im dealing with is Oracle 11g on a two node RAC cluster
Also i dont see any performance issues having a duplicate record too.
Assume you have a row that, over time, has 15 updates to it. If you don't store any temporal data (if you don't store different versions of the row), you end up storing one row. If you do store temporal data, you end up storing 15 rows.
You also need more indexes, because the id number is no longer sufficient to identify a single row.
If you have only relatively small tables, you probably won't see any performance difference. (There will be one, but it probably won't be noticeable to users.) But a table that has 10 million rows will perform differently than a table that has 150 million rows. (15 versions per row, times 10 million rows.)
Some times where there is a one to many relationship my assumption is
to have a mapping table where i map the multiple entity in individual
records by repeating master ID and changing child ID in each record.
Is this a right way to do it or is there a better way to do it.
You probably need to know which child rows belong to which parent rows. So you need more than a single master id for the key. The master id alone doesn't tell you which version of that row in the parent table applies to a given child row.
Is there a book on database best practices.
There are books on temporal databases. The first one that I know of is Snodgrass's Developing Time-Oriented Database Applications in SQL. It's available in several formats, and it's free. It's also kind of old, but the information in it is important to understand if you're going to be building a temporal database. Also, think about reading Date's book Temporal Data and the Relational Model.
Wikipedia has an article that summarizes the ideas behind temporal databases.
Is normalization completely mandatory.
That's a meaningless question. You will have different issues with tables normalized to 2NF than you'll have with tables normalized to 5NF or 6NF.
I would keep the old/history records in a separate table. Create an upd/del trigger to populate your audit/history table for you, and keep only the most current data in your main table.
See here for an example. Many other similar examples exists in SO.
Is there a simple way to recalculate some values after an insert occurs? I have a table with multiple column families, one of them statistical. I'd like to insert the original record and than have some HBase-specific facility to calculate the values offline - without blocking the insert.
Let's assume I put some files into an hbase table and want to have information about the number of lines in them and also the dates stored there.
I've been looking into RegionObserver and its preGet method. This solution works but I'm afraid it blocks the actual insert from occurring until the calculation is completed.
use the postPut method. You can see a short introduction to HBase's coprocessors here
Try apache Pig, this is best suited for stasticl calculations and can run in local as well as mapred mode
For more detail you can visit
http://pig.apache.com