Creating a delta live table for reading incremental data - azure-databricks

I have a table "x" which contains raw data in bronze layer. I have another table "Y" which is in silver layer and contains the transformed data. Now the incremental data is coming in the table x and I want to merge the incremental data from table x in bronze layer to table Y in silver layer.
So I want to build a Delta live table for doing this.

First exploded the data from the bronze table into a temporary table, and then flattened the data into another temporary table. Then load the data into my silver table using merge operation.
Kindly go through the official documentation. I followed the same scenario with the sample code in my environment. I added a silver table it's working fine for me without error and created a delta live table.
For more information refer this link it has detail information above delta live table with incremental load.

Related

Data Readiness Check

Let's say there is a job A which executes a Python to connect to Oracle, fetch the data from Table A and load the data into Snowflake once a day. Application A dependent on Table A in Snowflake can just depend on the success of job A for further processing, this is easy.
But if the data movement is via Replication (Change Data Capture from Oracle moves to s3 using Golden Gate, pipes pushes into stage, stream to target using Task every few mins) - what is the best way to let Application A know that the data is ready? How to check if the data is ready? is there something available in Oracle, like a table level marker that can be moved over to Snowflake? Table's in Oracle cannot be modified to add anything new, marker rows also cannot be added - these are impractical. But something that Oracle provides implicitly, which can be moved over to Snowflake or some SCN like number at the table level that can be compared every few minutes could be a solution, eager to know any approaches.

Advice on Setup

I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.

Generating star schema in hive

I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:
I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?
Load source data into a hive table; let say Sales_Data_Warehouse
Generate Dimension from sales_data_warehouse; ex:
SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse
When all dimensions are done then load the fact table like
SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key...
FROM Sales_Data_Warehouse AS 'source'
JOIN Customer_Dimension Customer on source.Customer_Name =
Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address
JOIN Store_Dimension AS 'Store' ON
Store.Store_Name = Source.Store_Name
JOIN Product_Dimension AS 'Product' ON .....
Is this the way I should load my fact and dimension table in hive?
Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?
We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.
The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table
I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.

How to do table operations in Google BigQuery?

Wanted some advice on how to deal with table operations (rename column) in Google BigQuery.
Currently, I have a wrapper to do this. My tables are partitioned by date. eg: if I have a table name fact, I will have several tables named:
fact_20160301
fact_20160302
fact_20160303... etc
My rename column wrapper generates aliased queries. ie. if I want to change my table schema from
['address', 'name', 'city'] -> ['location', 'firstname', 'town']
I do batch query operation:
select address as location, name as firstname, city as town
and do a WRITE_TRUNCATE on the parent tables.
My main issues lies with the fact that BigQuery only supports 50 concurrent jobs. This means, that when I submit my batch request, I can only do around 30 partitions at a time, since I'd like to reserve 20 spots for ETL jobs that are runnings.
Also, I haven't found of a way where you can do a poll_job on a batch operation to see whether or not all jobs in a batch have completed.
If anyone has some tips or tricks, I'd love to hear them.
I can propose two options
Using View
Views creation is very simple to script out and execute - it is fast and free to compare with cost of scanning whole table with select into approach.
You can create view using Tables: insert API with properly set type property
Using Jobs: insert EXTRACT and then LOAD
Here you can extract table to GCS and then load it back to GBQ with adjusted schema
Above approach will a) eliminate cost cost of querying (scan) tables and b) can help with limitations. But might not depends on the actual volumke of tables and other requirements you might have
The best way to manipulate a schema is through the Google Big Query API.
Use the tables get api to retrieve the existing schema for your table. https://cloud.google.com/bigquery/docs/reference/v2/tables/get
Manipulate your schema file, renaming columns etc.
Again using the api perform an update on the schema, setting it to your newly modified version. This should all occur in one job https://cloud.google.com/bigquery/docs/reference/v2/tables/update

Oracle To Teradata Record Validation

Project Background:
I am part of data migration project. Data is to be migrated from one platform(oracle) to another platform(teradata). My project requirement is that I have to compare whole data of each table between these two databases. But my problem is that client is not allowing us to create temp table at target database. So unable to use minus query for full data validation at target side.
When table has less data suppose less than 100,000 row. In this case it is easy for me to compare table data using excel(after importing in two different excel and then compare using excel in built compare tool or macro) But when row count is more than 100,000 suppose 10 or 200,000. In this case I am unable to use excel for full data validation.
Project belongs to banking sector so client is not allowing us to use any third party tool for data comparison.
My question is "How do I validate full data between two different database platform without using minus query in data migration project" Please help me in this scenario.

Resources