Project Background:
I am part of data migration project. Data is to be migrated from one platform(oracle) to another platform(teradata). My project requirement is that I have to compare whole data of each table between these two databases. But my problem is that client is not allowing us to create temp table at target database. So unable to use minus query for full data validation at target side.
When table has less data suppose less than 100,000 row. In this case it is easy for me to compare table data using excel(after importing in two different excel and then compare using excel in built compare tool or macro) But when row count is more than 100,000 suppose 10 or 200,000. In this case I am unable to use excel for full data validation.
Project belongs to banking sector so client is not allowing us to use any third party tool for data comparison.
My question is "How do I validate full data between two different database platform without using minus query in data migration project" Please help me in this scenario.
Related
I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.
i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
Thanks!
I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link
Wanted some advice on how to deal with table operations (rename column) in Google BigQuery.
Currently, I have a wrapper to do this. My tables are partitioned by date. eg: if I have a table name fact, I will have several tables named:
fact_20160301
fact_20160302
fact_20160303... etc
My rename column wrapper generates aliased queries. ie. if I want to change my table schema from
['address', 'name', 'city'] -> ['location', 'firstname', 'town']
I do batch query operation:
select address as location, name as firstname, city as town
and do a WRITE_TRUNCATE on the parent tables.
My main issues lies with the fact that BigQuery only supports 50 concurrent jobs. This means, that when I submit my batch request, I can only do around 30 partitions at a time, since I'd like to reserve 20 spots for ETL jobs that are runnings.
Also, I haven't found of a way where you can do a poll_job on a batch operation to see whether or not all jobs in a batch have completed.
If anyone has some tips or tricks, I'd love to hear them.
I can propose two options
Using View
Views creation is very simple to script out and execute - it is fast and free to compare with cost of scanning whole table with select into approach.
You can create view using Tables: insert API with properly set type property
Using Jobs: insert EXTRACT and then LOAD
Here you can extract table to GCS and then load it back to GBQ with adjusted schema
Above approach will a) eliminate cost cost of querying (scan) tables and b) can help with limitations. But might not depends on the actual volumke of tables and other requirements you might have
The best way to manipulate a schema is through the Google Big Query API.
Use the tables get api to retrieve the existing schema for your table. https://cloud.google.com/bigquery/docs/reference/v2/tables/get
Manipulate your schema file, renaming columns etc.
Again using the api perform an update on the schema, setting it to your newly modified version. This should all occur in one job https://cloud.google.com/bigquery/docs/reference/v2/tables/update
I'm importing data from a text file using Bulk Insert in the script component in SSIS package.
Package Ran successfully and data imported into SQL
Now how do I validate the completeness of the data?
1. I can get the row count from source and destination and compare.
but my manager wants to know how we can verify all the data has come a cross as it is without any issues.
If we are comparing 2 tables then probably a joining them on all fields and see anything missing out.
I’m not sure how to compare a text file and a sql table.
One way I could is write code to ready the file line by line and query the database for that record and compare each and every field. We have millions of records so this is not going to be a simple task.
Any other way to validate all of the data ??
Any suggestions would be much appreciated
Thanks
Ned
Well you could take the same file and do a look-up to the SQL source and if any of the columns don't match move to a row count.
Here's a generic example of how you can do this.
I am new to Oracle. Since we have rewritten an earlier application , we have to migrate the data from the earlier database in Oracle 9i to a new database , also in 9i, with totally different structures. The column names and types would be totally different. We need to map the tables and columns , try to export as much data as possible, eliminate duplicates, and fill empty values with defaults.
Are there any tools which can help in mapping the elements of the 2 databases , with rules to handle duplicates, and default values and migrate the data ?
Thanks,
Chak.
If your goal is to migrate data between two very different schemas you will probably need an ETL solution (ETL=Extract Transform Load).
An ETL will allow you to:
Select data from your source database(s) [Extract]
apply business logic to the selected data [Transform] (deal with duplicates, default values, map source tables/columns with destination tables/columns...)
insert the data into the new database [Load]
Most ETLs also allow some kind of automatisation and reporting of the loads (bad/discarded rows...)
Oracle's ETL is called Oracle Warehouse Builder (OWB). It is included in the Database licence and you can download it from the Oracle website. As most Oracle products it is powerful but the learning curve is a bit steep.
You may want to look into the [ETL] section here in SO, among others:
What ETL tool do you use?
ETL tools… what do they do exactly? In laymans terms please.
In many cases, creating a database link and some scripts a'la
insert into newtable select distinct foo, bar, 'defaultvalue' from oldtable#olddatabase where xxx
should do the trick