Reading Databricks tables in Azure - azure-databricks

Please clarify my confusion as I keep hearing we need read every Parquet file created by Databricks Delta tables to get to latest data in case of a SCD2 table. Is this true?
Can we simply use SQL and get the latest row?
Can we use some date/time columns to get to history of changes to that row?
Thanks

You can create and manage a Delta table as SCD2 if you wish. Actually, what confuses you must be the Time Travel feature. It just allow you to rollback your Delta table to any of its historical states which are automatically managed by Databricks.
Answer:
No. Instead, we need to look into all Delta lake transaction logs which are stored as JSON files inside _delta_log folder along with Parquet files and then see which parquet files are used as the latest data. See this doc for more information.
Yes. We can use SQL to get the latest row. Actually, any SQL script execution will always return latest data. However, there are 2 ways to get historical version of the table by SQL which are 1) specifying "as of version" and 2) specifying "as of timestamp". Read more.
PS This is a really good question.

Related

AWS DMS with CDC. The update records only include the updated field. How to include all?

We recently started the process of continuous migration (initial load + CDC) from an Oracle database on RDS to S3 using AWS DMS. The DB is using LogMiner.
the problem that we have detected is that the CDC records of type Update only contain the data that was updated, leaving the rest of the fields empty, so the possibility of simply taking as valid the record with the maximum timestamp value is lost.
Does anyone know if this can be changed or in what part of the DMS or RDS configuration to touch so that the update contains the information of all the fields of the record?
Thanks in advance.
Supplemental Logging at table level may increase what is logged, but that will also increase total volume of log data written for a given workload.
Many Log Based Data Replication products from various vendors require additional supplemental logging at the table level to ensure the full row data for updates with before and after change data is written to the database logs.
re: https://docs.oracle.com/database/121/SUTIL/GUID-D857AF96-AC24-4CA1-B620-8EA3DF30D72E.htm#SUTIL1582
Pulling data through LogMiner may be possible, but you will need to evaluate if it will scale with the data volumes you need.
DMS-FULL/CDC also supports Binary Reader better option to LogMiner. In order to capture updates WITH all the columns use "ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS" on Oracle side.
This will push all the columns in a update record to endpoint from Oracle RAC/non-RAC dbs. Also, a pointer for CDC is use TRANSACT_ID in DMS side to generate a unique sequence for each record. Redo will be little more but, it is what it is; you can keep an eye on it and DROP the supplemental logging if require at the table level.
Cheers!

Sqoop incremental load using Informatica BDM

I am new to Informatica BDM.I have a use case in which I have to import the data incrementally (100 tables) from RDBMS into Hive on daily basis. Can someone please guide me with the best possible approach to achieve this?
Thanks,
Sumit
Hadoop is write onces read many (WORM) approach and the incremental load is not easy stuff. There are following guideline you can follow and validate your current requirement
If the table is a small/mid-size and not having too many records,
better to refresh the entire table
If the table is too big and incremental load has add/update/delete operation, you can think of staging the delta and perform a join operation to re-create data set.
For large table and large delta, you can create a version number for all the latest record and each delta may come to a new directory and a view should be created to get the latest version for further processing. This avoid heavy merge operation.
If the delete operation is not coming as change, then you also need to think how to act on it and in such case, you need to get the full refresh.

Rewind a database table in Oracle and return to the current state in real time

We are working on a new release of our product and we want to implement a feature where a user can view older data for a disease prediction made by the application. For example, the user would have an option to go back in time to see the predictions made one year ago. At the database level what needs to happen is to fetch archived data.The number of tables in the database is around 200 and out of them the tables that need to go back to an older state
I read about Flashbacks and although they seem to be used more for recovery, was curious to know if they can be used.
1> Would it be possible to use Flashbacks?
2> If yes, how would it affect performance?
3> If no what could be some other options?
Thank you
Flashback could be a way, but you need to use flashback data archive for the table you want. Using this technology, you can choose how much time you want to be able to store. What i find interesting using flashback technology, is that you query the same table (with some aditional options), instead of the other option, of creating a history table.

IBM Data Stage - How to find database tables used in jobs

For a project we need to investigate an existing installation of IBM Data Stage, doing a whole lot of ETL in loads of jobs.
The job flow diagrams contain lots of tables being used a source (both in MSSQL as well as Oracle), as well as a target (mostly in Oracle).
My question is now
How can I find all database tables used by all jobs in a certain Data Stage Project ?
I looked in Tools - Advanced Find, and there I can see all "table definitions". BUT, most of the tables actually used in jobs do not show up there, as they are defined as what Data Stage calls "Parallel Jobs" which in effect are SQL queries against database tables.
I am particularly interested in locating TARGET tables which are being loaded by a job.
So to put it bluntly, I want to be able to answer the question "Which job loads table XY ?".
If that is not possible, an automated means of extracting all the SQL statements used by the jobs would be an alternative.
We have access to IBM Websphere Data Stage and Quality Stage Designer 8.1
Exporting the jobs creates a text file that details what the job does. Open the export file in a text editor and you should be able to find SQL inserts with a simple search. Start with searching for SQL keywords like 'INTO' and 'FROM'.
Edit: Alternatively, if every table that was used was defined by importing table definitions, you should be able to find the table definition in the folder for its type. This however, will not make it apparent where and how the table was used (which job, insert or select from?), so I would recommend the first method of searching the Export files.

MERGE in Vertica

I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.
You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.
If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.
I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.
To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.

Resources