How to do real time data ingestion from Transactional tables to a Flat table - oracle

We have transaction tables in Oracle and for reporting purposes we need this data transfered in real time to another flat Oracle table in another database. The performance of the report is great with table placed in this flat table.
Currently we are using golden gate for replication to the other database and using materialized view for this but due to some problems we need to switch to some other way of populating/maintaining this flat table. What options do we have?
It is a pretty basic requirement but the solutions I can see are for batch processing. Also if there are any other solutions you feel would better serve this purpose. Changing the target database to something other is also an option as there might be more such reports coming ahead.

Related

oracle schema sharing , is it possible?

Trying to understand if there is any such concept like this in Oracle Database.
Let's say I have two Databases, Database_A & Database_B
Database_A has schema_A, is there a way I can attach this schema to Database_B?
What I mean by this is if there is a job populating a TABLE_A in schema_A, I can see that read-only view in Database_B. We are trying to split a big Oracle database into two smaller databases and have a vast PL/SQL code, and trying to minimize the refactoring here.
Sharding might be what you're looking for. The schemas and tables will still logically exist on all databases, but you can arrange the data to be physically stored in specific databases. There might be a way to setup shardspaces, tablespaces, and user default tablespaces in a way where each schema's data is automatically stored in a specific database.
But I haven't actually used sharding. From what I've read, it seems to be designed for massive distributed OLTP systems, and it is likely complicated to administer. I'd guess this feature isn't worth the hassle unless you have petabytes of data.

Hive Managed vs External tables maintainability

Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?
And by maintaining, i mean that these tables will have the following operations on daily basis frequently;
Select using partitions most of the time.. but for some of it they are not used.
Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?
It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.
While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.
Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.
You can refer the below links which can give you few pointers in considerations
External Vs Managed tables comparison

Seeking Opinion:Denormalising Fact and Dim tables to improve performance of SSRS Reports

We seem to have bit of a debate on a discussion point in our team.
We are working on a Data Warehouse in the Microsoft SQL Server 2012 platform. We have followed the Kimball Architecture to build this Data Warehouse.
Issue:
A reporting solution (built on SSRS), which sources data from this Warehouse, has significant performance issues when sourcing data from fact and dim tables. Some of our team members suggest that we extract data from facts and dims into a new set of tables using SSIS packages. This would mean we denormalise these tables into ‘Snapshot’ tables. In this way the we would not need to join these tables to create data sets within the reports. Data could be read out of these tables directly.
I do have my own worries about this; inconsistencies, maintenance of different data structures, duplication of data etc to name a few.
Question:
Would you consider creating snapshot tables (by denormalising facts and dim tables) for reporting tables a right approach?
Would like to hear your thoughts on this.
Cheers
Nithin
I don't think there is anything wrong with snapshot tables. The two most important aspects of a data warehouse are:
The data is correct.
The data is useful.
If your users are unable to extract the totals they require, in a reasonable timescale, they won't use the warehouse.
My own solution includes 3 snapshot tables. Like you, I was worried about inconsistencies. To address this we built an automated checking process. This sub-system executes a series of queries, stored on a network drive, once an hour. Any records returned by the queries are considered a fail. Fails are reported and immediately investigated by my ETL team. This sub-system ensures the snapshots and underlying facts are always aligned and consistent with each other. Drift is prevented.
That said, additional tables equals additional complexity. And that requires more time/effort to manage. Before introducing another layer to your warehouse, you should investigate why these queries are underperforming. If joins are to blame:
Are you using an inappropriate data type, for your P/F keys?
Are the FKeys indexed (some RDBMS do this by default, others do not)?
Have you looked at the execution plans, for the offending queries?
Is the join really to blame, or is it a filter applied to the dim table?
for raw cube performance my advice would be to always try to denormalize your tables and have one fact table and one table for each dimension (star schema).
If you are unsure if it will actually help you could start creating materialized views. These are kind of the best of both worlds, on the long run you should alter your etl.
In my previous job we only had flattened tables which worked quite well. Currenly we have a normalized schema but flatten it in the last step.

"Saving" BigQuery Views for use in Tableau

I'm trying to make faster dashboards in Tableau by creating views of my calculations directly in BigQuery.
Based on my understating if the gcloud documentation here, the view will re-execute the query once it is accessed, so it kinda defeats my goal.*
*My goal is to eliminate calculations on the fly, be it in Tableau or BigQuery.
Is it possible to "save" these views, by way of scheduled scripts or workflows?
Thanks,
A view is best thought of as a way to reformat a table to make it look more convenient to further queries. The query still has to run on BigQuery so the benefits will be that the view may look simpler to Tableau than the raw table (particularly convenient if the view uses some complex SQL to create some of its columns). But it won't save calculation time.
But, if your view is doing some complex consolidation of a larger table then it might be worth saving the results as a new table instead of creating a view. This is OK if your underlying table doesn't change frequently (rule of thumb if you use the results every day and the table changes weekly, it is probably worthwhile and certainly so if the changes are monthly). Then Tableau will be querying pre-consolidated results rather than the much larger raw table. BigQuery storage and processing is cheap so this is often a reasonable solution.
Another alternative is to use a Tableau extract to bring the data into your local drive or server. This is only practical if the table is small enough to fit locally and will only work really well for speed if it fits into local memory (which can be a lot more than you might think). But extracts, at least on Tableau server, can be set to refresh on a schedule, making much faster user interaction and absolving you of having to remember to manually update the consolidated table.

How to implement an ETL Process

I would like to implement a synchronization between a source SQL base database and a target TripleStore.
However for matter of simplicity let say simply 2 databases. I wonder what approaches to use to have every change in the source database replicated in the target database. More specifically, I would like that each time some row changes in the source database that this can be seen by a process that will read the changes and populate the target database accordingly while applying some transformation in the middle.
I have seen suggestion around the mechanism of notification that can
be available in the database, or building tables such that changes can
be tracked (meaning doing it manually) and have the process polling it
at different intervals, or the usage of Logs (change data capture,
etc...)
I'm seriously puzzle about all of this. I wonder if anyone could give some guidance and explanation about the different approaches with respect to my objective. Meaning: name of methods and where to look.
My organization mostly uses: Postgres and Oracle database.
I have to take relational data and transform them in RDF so as to store them in a triplestore and keep that triplestore constantly synchronized with the data is the SQL Store.
Please,
Many thanks
PS:
A clarification between ETL and replication techniques as in Change Data capture, with respect to my overall objective would be appreciated.
Again i need to make sense of the subject, know what are the methods, so i can further start digging for myself. So far i have understood that CDC is the new way to go.
Assuming you can't use replication and you need to use some kind of ETL process to actually extract, transform and load all changes to the destination database, you could use insert, update and delete triggers to fill a (manually created) audit table. Columns GeneratedId, TableName, RowId, Action (insert, update, delete) and a boolean value to determine if your ETL process has already processed this change. Use that table to get all the changed rows in your database and transport them to the destination database. Then delete the processed rows from the audit table so that it doesn't grow too big. How often you have to run the ETL process depends on the amount of changes occurring in the source database.

Resources