Generating star schema in hive - hadoop

I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:
I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?
Load source data into a hive table; let say Sales_Data_Warehouse
Generate Dimension from sales_data_warehouse; ex:
SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse
When all dimensions are done then load the fact table like
SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key...
FROM Sales_Data_Warehouse AS 'source'
JOIN Customer_Dimension Customer on source.Customer_Name =
Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address
JOIN Store_Dimension AS 'Store' ON
Store.Store_Name = Source.Store_Name
JOIN Product_Dimension AS 'Product' ON .....
Is this the way I should load my fact and dimension table in hive?
Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?

We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.
The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table
I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.

Related

Hive Managed vs External tables maintainability

Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?
And by maintaining, i mean that these tables will have the following operations on daily basis frequently;
Select using partitions most of the time.. but for some of it they are not used.
Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?
It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.
While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.
Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.
You can refer the below links which can give you few pointers in considerations
External Vs Managed tables comparison

Primary keys and indexes in Hive query language is poosible or not?

We are trying to migrate oracle tables to hive and process them.
Currently the tables in oracle has primary key,foreign key and unique key constraints.
Can we replicate the same in hive?
We are doing some analysis on how to implement it.
Hive indexing was introduced in Hive 0.7.0 (HIVE-417) and removed in Hive 3.0 (HIVE-18448) Please read comments in this Jira. The feature was completely useless in Hive. These indexes was too expensive for big data, RIP.
As of Hive 2.1.0 (HIVE-13290) Hive includes support for non-validated primary and foreign key constraints. These constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. These constraints are useful for tools generating ER diagrams and queries. Also such non-validated constraints are useful as self-documenting. You can easily find out what is supposed to be a PK if the table has such constraint.
In Oracle database Unique, PK and FK constraints are backed with indexes, so they can work fast and are really useful. But this is not how Hive works and what it was designed for.
Quite normal scenario is when you loaded very big file with semi-structured data in HDFS. Building an index on it is too expensive and without index to check PK violation is possible only to scan all the data. And normally you cannot enforce constraints in BigData. Upstream process can take care about data integrity and consistency but this does not guarantee you finally will not have PK violation in Hive in some big table loaded from different sources.
Some file storage formats like ORC have internal light weight "indexes" to speed-up filtering and enabling predicate push down (PPD), no PK and FK constraints are implemented using such indexes. This cannot be done because normally you can have many such files belonging to the same table in Hive and files even can have different schemas. Hive created for petabytes and you can process petabytes in single run, data can be semi-structured, files can have different schemas. Hadoop does not support random writes and this adds more complications and cost if you want to rebuild indexes.

Oracle 11g - Building a Type 2 SCD based on existing historical data in a relational model

I'm an ETL developer that's currently being tasked with developing a type 2 SCD from existing historical data in a relational database. I'm perfectly capable of creating a type 2 SCD that's responsible for tracking future changes to the data, but I'm completely useless when it comes to the task at hand.
The relational model is in our ODS . Based on that relational model, I'm supposed to build flat records in our DW dimension. There are multiple attributes which need to be monitored for changes, each in specific related tables in the relational model. Historical changes must be kept on a daily basis, and if multiple changes to the same attribute occur on the same day, only the last subsists.
How can I tackle this? I'm lost. Thanks in advance.
P.S. we're talking tables with 20-30 million rows and multiple attributes that may change at any given time and therefore must result in a new record in the SCD.
This will indeed be painful. I'm assuming from your question that the tables containing the attribute values are currently varying independently (or you wouldn't need to ask the question).
If you have a table 'Table1' containing 'Key', 'Attribute1' and 'Effective From','Effective To' columns, then you can 'explode' that table into a virtual table in the form 'Key','Attribute1','Date', projecting out one row for every date where that attribute was current.
(Note that you probably don't want to do this as a ranged join against your date dimension, because this will be a Triangular Join (ie perform really badly), you probably need to explode the rows in an ETL tool/programmatically)
If you perform this process across multiple tables, you will have a set of tables giving you the full day-by-day snapshot of each attribute for every day that you care about. It's then fairly easy to join those tables based on 'FK' and 'Date' to give you the complete daily snapshot across all of the attribute values.
Then, of course, you need to run this though another process to collapse rows with the same Key, contiguous dates and all the same attribute values, ie 'unexplode' the rows, back into 'effective from','effective to' form. Note again, that this is fundamentally a row-by-row operation (or at very least a windowing function), and a set-based approach will perform very badly. Personally I'd just stream it all though some .net/java code to achieve this.
Given data volumes this will take a while, but should be achievable.

Performance tuning of HIVE tables using index - works and issues? [duplicate]

We are trying to migrate oracle tables to hive and process them.
Currently the tables in oracle has primary key,foreign key and unique key constraints.
Can we replicate the same in hive?
We are doing some analysis on how to implement it.
Hive indexing was introduced in Hive 0.7.0 (HIVE-417) and removed in Hive 3.0 (HIVE-18448) Please read comments in this Jira. The feature was completely useless in Hive. These indexes was too expensive for big data, RIP.
As of Hive 2.1.0 (HIVE-13290) Hive includes support for non-validated primary and foreign key constraints. These constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. These constraints are useful for tools generating ER diagrams and queries. Also such non-validated constraints are useful as self-documenting. You can easily find out what is supposed to be a PK if the table has such constraint.
In Oracle database Unique, PK and FK constraints are backed with indexes, so they can work fast and are really useful. But this is not how Hive works and what it was designed for.
Quite normal scenario is when you loaded very big file with semi-structured data in HDFS. Building an index on it is too expensive and without index to check PK violation is possible only to scan all the data. And normally you cannot enforce constraints in BigData. Upstream process can take care about data integrity and consistency but this does not guarantee you finally will not have PK violation in Hive in some big table loaded from different sources.
Some file storage formats like ORC have internal light weight "indexes" to speed-up filtering and enabling predicate push down (PPD), no PK and FK constraints are implemented using such indexes. This cannot be done because normally you can have many such files belonging to the same table in Hive and files even can have different schemas. Hive created for petabytes and you can process petabytes in single run, data can be semi-structured, files can have different schemas. Hadoop does not support random writes and this adds more complications and cost if you want to rebuild indexes.

Simulating a columnar store using cluster tables

I have a client that mostly uses calculations on a single column of many rows from a table (each time another column), which is classic for a columnar DB.
The problem is that he is using Oracle, so what I thought of doing was to build a bunch of cluster table where each table has just one column besides the PK and this way allow him to work in a pseudo-columnar model.
What are you thoughts on the subject?
Will it even work as expected or am I just forcing the solution here ?
Thanks,
Daniel
I didn't test it in the end but I did achieve close to vertical performance time using sorted table hash cluster.

Resources