I have a architectural requirement to have the data stored in ADLS under a medallion model, and are trying to achieve writing to ADLS using Delta Live Tables as a precursor to creating the Delta Table.
I've had had success using CREATE TABLE {dlt_tbl_name} USING DELTA LOCATION {location_in_ADLS} to create the Delta Table without Delta Live... however the goal is to use Delta live and I don't see how this method is supported in Delta Live
Anyone have a suggestion? I'm guessing at this point that writing to ADLS isn't supported.
If you look into the documentation then you can see that you can specify path parameter for the #dlt.table (for Python). And similarly, you can specify LOCATION parameter when using SQL (docs). You just need to make sure that you're provided all necessary Spark configuration parameters on the pipeline level (with service principal or SAS token). Following code works just fine:
In Python:
import dlt
#dlt.view
def input():
return spark.range(10)
#dlt.table(
path="abfss://test#<account>.dfs.core.windows.net/dlt/python"
)
def python():
return dlt.read("input")
In SQL:
CREATE OR REFRESH LIVE TABLE sql
LOCATION 'abfss://test#<account>.dfs.core.windows.net/dlt/sql'
AS SELECT * from LIVE.input
Related
I have an autoloader table processing a mount point with CSV files.
After each run, I would like to insert some of the records into another table where I have an AutoIncrement Identity column set up.
I can rerun the entire insert and this works, but I am trying to only insert the newest records.
I have CDF enabled, so I should be able to determine the latest version, or maintain the versions processed. But it seems like I am missing some built in feature of Databricks.
Any suggestions or sample to look at?
Note - Delta change data feed is available in Databricks Runtime 8.4
and above.
You can read the change events in batch queries using SQL and DataFrame APIs (that is, df.read), and in streaming queries using DataFrame APIs (that is, df.readStream).
Enable CDF
%sql
ALTER TABLE silverTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
Any suggestions or sample to look at?
You can find Sample Notebook here
I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.
I am trying to design a sort of data pipeline to migrate my Hive tables into BigQuery. Hive is running on an Hadoop on premise cluster. This is my current design, actually, it is very easy, it is just a shell script:
for each table source_hive_table {
INSERT overwrite table target_avro_hive_table SELECT * FROM source_hive_table;
Move the resulting avro files into google cloud storage using distcp
Create first BQ table: bq load --source_format=AVRO your_dataset.something something.avro
Handle any casting issue from BigQuery itself, so selecting from the table just written and handling manually any casting
}
Do you think it makes sense? Is there any better way, perhaps using Spark?
I am not happy about the way I am handling the casting, I would like to avoid creating the BigQuery table twice.
Yes, your migration logic makes sense.
I personally prefer to do the CAST for specific types directly into the initial "Hive query" that generates your Avro (Hive) data. For instance, "decimal" type in Hive maps to the Avro 'type': "type":"bytes","logicalType":"decimal","precision":10,"scale":2
And BQ will just take the primary type (here "bytes") instead of the logicalType.
So that is why I find it easier to cast directly in Hive (here to "double").
Same problem happens to the date-hive type.
What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?
If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.
The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.
I was wondering, for folks familiar with DataStage, if Oracle SQLLDR can be used on DataStage. I have some sets of control files that I would like to incorporate into DataStage. A step by step way of accomplishing this will greatly be appreciated. Thanks
My guess is that you can run it with external stage in data stage.
You simply put the SQLLDR command in the external stage and it will be executed.
Try it and tell me what happens.
We can use ORACLE SQL Loader in DataStage .
If you check Oracle Docs there are two types of fast loading under SQL Loader
1) Direct Path Load - less validation in database side
2) Conventional Path Load
There is less validation in Direct Load if we compare to Conventional Load.
In SQL Loader process we have to specify points like
Direct or not
Parallel or not
Constraint and Index options
Control and Discard or Log files
In DataStage , we have Oracle Enterprise and Oracle Connector Stages
Oracle Enterprise -
we have load option in this stage to load data in fast mode and we can set Environment variable OPTIONS
for Oracle , example is below
OPTIONS(DIRECT=FALSE,PARALLEL=TRUE)
Oracle Connector -
We have bulk load option for it and other properties related to SQL Loader are available in properties tab .
Example - control and discard file values all set by DataStage but you can set these properties and others manually.
As you know SQLLDR basically loads data from files to database so datastage allows you to use any input data file, that would take input in any data file like sequential file, pass them format, pass the schema of the table, and it’ll create an in memory template table, then you can use a database connecter like odbc or db2 etc. and that would load your data in your table, simple as that.
NOTE: if your table does not exist already at the backend then for first execution make it create then set it to append or truncate.
Steps:
Read the data from the file(Sequential File Stage)
Load it using the Oracle Connector(You could use Bulk load so that you could used direct load method using the SQL loader and the data file and control file settings can be configured manually). Bulk Load Operation: It receives records from the input link and passes them to Oracle database which formats them into blocks and appends the blocks to the target table as opposed to storing them in the available free space in the existing blocks.
You could refer the IBM documentation for more details.
Remember, there might be some restriction in loading when it comes to handling rejects, triggers or constraints when you use bulk load. It all depends on your requirement.