Synapse serverless pool to query delta table previous versions - azure-databricks

Can we use Synapse serverless pool (Built-in) to query a delta file's previous version?
I am keen to a SQL statement similar to what we do in Databricks:
select * from delta.`/my_dir` version as of 2
Does the OPENROWSET support support a "version selection" option?
If not possible, does registering the delta table to an external managed table helps?

When I try to query delta table in serverless sql pool in synapse using below code:
select * from delta.original version as of 0
I got below output:
As per this
Serverless SQL pools don't support time travel queries.
AFAIK with SQL commands are not supported to time travel with Delta Lake. But you can use spark pool loading the data into a dataframe with PySpark.
I used below code to load data to data frame:
df = spark.read.format("delta").option("versionAsOf",0)
.load("<file location>")
I displayed the data frame using below code:
df.show()
You can try in this way to query with delta table.

Related

Azure Sql Database - how can I tell what is blocking my IO

Is there a way to 'defrag' a table in Azure Sql or to understand why a table scan is so slow? I'm using an Azure Sql Database to drive an SSAS Tabular Model. My source table has ~30M rows and is currently reading/processing only 1M rows/hour, with 15M rows it was processing 1M/minute. I'm using sp_whoisactive but for the query that's driving the Tabular Process I see IO blocking and no CPU/IO values. Other queries which don't need a table scan run ok.
Thanks for your help

Does the JDBC Kafka Sink connector support Oracle partitioning when inserting / updating data in an Oracle database?

I am asking because I will have a sink that will be in "upsert" mode and the target Oracle table which is partitioned. I wonder if the update performance will be good due to millions of records in the target table.
So long as you've set your table's partition key on the column that you're using for the incremental predicate (ID / timestamp) I don't see why Oracle won't be able to take advantage of partition pruning to improve the fetch performance—but this is on the Oracle and data model side, not something that's implemented by the connector.
The connector does not support anything like partition-exchange loading etc.

How to convert CONNECT BY in greenplum

Can anyone suggest how to convert CONNECT BY Oracle query into Greenplum. Greenplum doesn't support recursive queries. So, we can not use WITH RECURSIVE. Is there any alternate solution to re-write the below query.
SELECT child_id, Parnet_id, LEVEL , SYS_CONNECT_BY_PATH (child_id,'/') as HIERARCHY
FROM pathnode
START WITH Parnet_id = child_id
CONNECT BY NOCYCLE PRIOR child_id = Parnet_id;
There are ways to do this but it will be a one-off per query. You will need to create a function that loops through your pathnode table and "return next" to return each row. You can search on this site to find examples of doing this with PostgreSQL 8.2.
Work is happening to rebase Greenplum to PostgreSQL 8.3, 8.4, and so on. Those later PostgreSQL versions support "with recursive" which is the ANSI SQL way to write your SQL but Greenplum doesn't support it yet. When it does get supported by Greenplum, I don't think it will perform all that well. The query will force looping and individual row lookups. This works great in an OLTP database but not so well for an MPP database.
I suggest you transform your data in Oracle with a VIEW and then just dump the view to a file to load into Greenplum. The DDL of having a self-referencing, N-level table will never be a good idea in an MPP database.

Microstrategy - HBase connection

We are trying to connect MS 9.4 to HBase via Impala connector.
First we created the hive tables liking them to HBase tables with following create table (as we saw in the docs):
CREATE TABLE hiveTableName1
(key int, columnName1 codClient, columnName2 clientName)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,columnfamily1:columnName1,columnfamily1:columnName2")
TBLPROPERTIES ("hbase.table.name" = "hbaseTableName1");
We did this twice, since we want to crete two hive tables and their correspondent hbase tables, in order to perform a join between them later with MS.
For the connection between MS with HBase, we follow the steps by selecting the MicroStrategy ODBC Driver for Impala Wire Protocol, and filling in the Data Source Name (Impala Data Source previously created with the Impala Driver), host and port (both for Impala installation in our AWS infraestructure) and impala/impala for credentials.
The thing is that when we finish complete the wizard and select the default namespace (which is the only one available. No any other ns has been created), we can see the hive tables that we created before, instead of the hbase tables.
I mean:
hiveTableName1
hiveTableName2
instead of
hbaseTableName1
hbaseTableName2
And, since these are the only tables availables, we only can perform our report with these two tables: a very easy join between these two tables by one field.
Both tables have 200.000 records and the join takes more than 1 minute to complete.
I'm sure that we are missing something here, and the process of linking hive tables to hbase ones are not completely right.
Is there a way to be able to connect to these two hbase tables instead of hive ones?
Any help will be really appreciated.
1. HBase does not support SQL and does not support the concept of "join" anyway.
2. Mapping Hive tables on HBase tables means that every Hive query triggers a full scan on HBase side, then the result is fed to a MapReduce batch job that does the filters and the joins.
Bottom line: 1 min is quite fast for what you are doing.
If you expect sub-second results, try some "small data" technologies (e.g. MySQL, Oracle, even MS Access) or forget about joins.
For sub-minutes results, you might give a try to Apache Phoenix: it's a HBase wrapper with indexes and some kind of SQL. Not sure about ODBC/JDBC drivers though.

SSIS - Iterating with SQL Server Data in ForEachLoop to Dataflow with Oracle Backend and Inserting Results to SQL Server

Hey EXPERIENCED SSIS DEVELOPERS, I need your help.
High-Level Requirements
Query SQL Server table (on a different server than my SSIS server) resulting in about 200-300k records results set.
Use three output colums for each row to lookup date in Oracle database.
Insert or Update SQL Server table with results.
Use SSIS.
SQL Server 2008
Sounds easy, right?
Here is what I have done:
Created on Control Flow Execute SQL Task that gets a recordset from SQL Server. Very fast, easy query, like select field1, field2, field 3 from table where condition > 0. That's it. Takes less than a second.
Created a variable (evaluated as expression) for the Oracle query that uses the results set from the above in the WHERE clause.
Created a ForEachLoop Container that takes the results (from #1 above) for each row in the recordset and runs it through a Data Flow that uses the Oracle query (from #2 above) with Data access mode: SQL command from variable against an Oracle data source. Fast, simple query with only about 6 columns returned.
Data Conversion - obvious reasons - changing 3 columns from Oracle data types to SQL Server data types.
OLE DB Destination to insert to SQL Server using Fast Load to staging table.
It works perfectly! Hooray! Bad news - it is very, very slow. When I say slow, I mean it process 3000 records per hour. Holy moly - so freaking slow.
Question: am I missing a way to speed it up? It seems like the ForEachLoop Container is the bottleneck. Growl.
Important Points:
- I have NO write access in Oracle environment, so don't even suggest a potential solution that requires it. Not a possibility. At all.
Oracle sources do not allow for direct parameter definition. So no SELECT FIELD FROM TABLE WHERE ?. Don't suggest it - doesn't work.
Ideas
- Should I find a way to break down the results of the Execute SQL task and send them through several ForEachLoop Containers for faster processing?
Is there another design that is more appropriate?
Is there a script I can use that is faster?
Would it be faster to create a temporary table in memory and populate it - then use the results to bulk insert to SQL Server? Does this work when using an Oracle data source?
ANY OTHER IDEAS?

Resources