Teradata TPT Load Script Performance - performance

I hope someone can help me with the improvement for Teradata TPT load Script. I am using the below script to load a 3GB delimited CSV file into Teradata. The file resides on my local laptop hard drive. The file takes approximately 30 minutes to load which is quite long. The total number of rows to be loaded are approximately 30 million. Any recommendations on performance improvement.
DEFINE JOB LOAD_TD_FROM_CSV
DESCRIPTION 'Load Teradata table from CSV File'
(
DEFINE SCHEMA FOOD_TPT /*Define Table schema*/
DESCRIPTION 'FOOD_TPT'
(
col1 VARCHAR(20),
col2 VARCHAR(100),
col3 VARCHAR(100),
col4 VARCHAR(100),
col5 VARCHAR(100),
col6 VARCHAR(100),
col7 VARCHAR(100),
col8 VARCHAR(100)
);
DEFINE OPERATOR DDL_OPERATOR
TYPE DDL
ATTRIBUTES
(
VARCHAR TdpId = 'system', /*System Name*/
VARCHAR UserName = 'user', /*USERNAME*/
VARCHAR UserPassword = 'password', /*Password*/
VARCHAR Errorlist ='3807' /*This is added to skip 'Table does not exist error' and treat it as warnig*/
);
DEFINE OPERATOR LOAD_CSV /*Load information*/
DESCRIPTION 'Operator to Load CSV Data'
TYPE LOAD
SCHEMA *
ATTRIBUTES
(
VARCHAR PrivateLogName,
VARCHAR TraceLevel = 'None',
INTEGER TenacityHours = 1,
INTEGER TenacitySleep = 1,
INTEGER MaxSessions = 4,
INTEGER MinSessions = 1,
INTEGER BUFFERSIZE =16,
VARCHAR TargetTable = 'FOOD_TPT_STG', /*Define target table name where the file will be loaded*/
VARCHAR LogTable = 'FOOD_TPT_LOG', /*Define Log table name*/
VARCHAR ErrorTable1 = 'FOOD_TPT_STG_E1',/*There are 2 error tables. Define them. First table is _ET table*/
VARCHAR ErrorTable2 = 'FOOD_TPT_STG_E2', /*Define _UV table*/
VARCHAR TdpId = 'system', /*System Name*/
VARCHAR UserName = 'user', /*Username*/
VARCHAR UserPassword = 'password' /*Password*/
);
DEFINE OPERATOR READ_CSV
DESCRIPTION 'Operator to Read CSV File'
TYPE DATACONNECTOR PRODUCER
SCHEMA FOOD_TPT
ATTRIBUTES
(
VARCHAR Filename = 'file.csv' /*give file name with path*/
,VARCHAR Format = 'Delimited'
,VARCHAR TextDelimiter = ','
,VARCHAR AcceptExcessColumns = 'N'
,VARCHAR PrivateLogName = 'LOAD_FROM_CSV'
,Integer SkipRows=1 /*skips the header in csv file*/
);
Step Setup_Tables /*Enter all executable SQLs in this step*/
(
APPLY
('Drop table FOOD_TPT_STG_E1;'), /*Drop error tables*/
('Drop table FOOD_TPT_STG_E2;'),
('Drop table FOOD_TPT_LOG;'), /*Drop Log Table*/
('Drop table FOOD_TPT_STG;'), /*Drop Target staging tables*/
('CREATE TABLE FOOD_TPT_STG ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
datablocksize= 1022 kbytes,
DEFAULT MERGEBLOCKRATIO
(
col1 VARCHAR(20),
col2 VARCHAR(100),
col3 VARCHAR(100),
col4 VARCHAR(100),
col5 VARCHAR(100),
col6 VARCHAR(100),
col7 VARCHAR(100),
col8 VARCHAR(100)
)
NO PRIMARY INDEX;') /*Create Target table*/
TO OPERATOR (DDL_OPERATOR);
);
Step Load_Table
(
APPLY ('INSERT INTO FOOD_RISK_TPT_STG
(
:col1
,:col2
,:col3
,:col4
,:col5
,:col6
,:col7
,:col8
);') /*Inserts records from CSV file into Target Table*/
TO OPERATOR (LOAD_CSV)
SELECT * FROM operator(READ_CSV);
);
);
Thanks in advance

As Fred wrote, you specify a BUFFERSIZE of 16KB -> 22 rows per block (FastLoad calculates this based on the defined max size) which results in 1.6 million messages send. Remove the attribute and you get 1 MB as default, 1400 rows per block. Additionally you might simplify your scripts like this:
DEFINE JOB LOAD_TD_FROM_CSV
DESCRIPTION 'Load Teradata table from CSV File'
(
Step Setup_Tables /*Enter all executable SQLs in this step*/
(
APPLY
('Drop table FOOD_TPT_STG_E1;'), /*Drop error tables*/
('Drop table FOOD_TPT_STG_E2;'),
('Drop table FOOD_TPT_LOG;'), /*Drop Log Table*/
('Drop table FOOD_TPT_STG;'), /*Drop Target staging tables*/
('CREATE TABLE FOOD_TPT_STG ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
datablocksize= 1022 kbytes,
DEFAULT MERGEBLOCKRATIO
(
col1 VARCHAR(20),
col2 VARCHAR(100),
col3 VARCHAR(100),
col4 VARCHAR(100),
col5 VARCHAR(100),
col6 VARCHAR(100),
col7 VARCHAR(100),
col8 VARCHAR(100)
)
NO PRIMARY INDEX;') /*Create Target table*/
TO OPERATOR ($DDL
(
TdpId = 'system', /*System Name*/
UserName = 'user', /*USERNAME*/
UserPassword = 'password', /*Password*/
Errorlist ='3807' /*This is added to skip 'Table does not exist error' and treat it as warnig*/
)
);
Step Load_Table
(
APPLY ($INSERT 'FOOD_RISK_TPT_STG') /*Inserts records from CSV file into Target Table*/
TO OPERATOR ($LOAD(
/* BUFFERSIZE = 16384, Default is 1 MB, increasing it further to the max 16MB might improve a bit */
TargetTable = 'FOOD_TPT_STG', /*Define target table name where the file will be loaded*/
LogTable = 'FOOD_TPT_LOG', /*Define Log table name*/
ErrorTable1 = 'FOOD_TPT_STG_E1',/*There are 2 error tables. Define them. First table is _ET table*/
ErrorTable2 = 'FOOD_TPT_STG_E2', /*Define _UV table*/
TdpId = 'system', /*System Name*/
UserName = 'user', /*Username*/
UserPassword = 'password' /*Password*/
)
)
SELECT * FROM operator($FILE_READER
(
Filename = 'file.csv' /*give file name with path*/
,Format = 'Delimited'
,TextDelimiter = ','
,AcceptExcessColumns = 'N'
,PrivateLogName = 'LOAD_FROM_CSV'
,SkipRows=1 /*skips the header in csv file*/
));
);
And there are job variables files to make scripts better reusable

Related

AWS Randomize data for large tables

I have a table in redshift with values over 1.8 billion i am trying to randomize that data.
Here is the table values attributes
id bigint,
customer_internal_id bigint,
customer_id VARCHAR(256) Not NULL,
customer_name VARCHAR(256) Not NULL,
customer_type_id bigint,
start_date date,
end_date date,
request_id bigint,
entered VARCHAR(256) not NULL,
superseded VARCHAR(256) not NULL,
customer_latitude double precision,
customer_longitude double precision,
zip_internal_id bigint
How can i achieve this as i tried to look for more of an option but there is no enough documentation available.
Here is the expected output.
i have some code written for PostgresSQL
with result as (
select id, customer_id, customer_name,
lead(customer_id) over w as first_1,
lag(customer_name) over w as first_2
from master.customer_temp_df
window w as (order by random())
)
update master.customer_temp_df
set customer_id = coalesce(first_1, first_2),customer_name = coalesce(first_2, first_1)
from result
where master.customer_temp_df.id = result.id;
but this doesnt work in redshift and i am looking for something like this.
The final goal is to randomize entire table.

How to update columns with different values when there are duplicate identifier values

I have a table with definition:
CREATE TABLE test(
id NUMBER(19,0),
nam VARCHAR2(50) NOT NULL,
email VARCHAR2(50) NOT NULL
);
and the data
I have to set the same IDs for the entries who have the same EMAIL.
How can I do it?
I am using the oracle 18g database.
expected results
If you just want to have the same id for the matching emails:
MERGE INTO test_table tt
USING (SELECT MIN(ID)
, email
FROM test_table
GROUP BY email) mails
ON (tt.email = mails.email)
WHEN MATCHED THEN UPDATE SET tt.id = mails.id;

How to optional update data on oracle?

i have this table:
CREATE TABLE "ALMAT"."PRODUCT"
( "ID" NUMBER(*,0) NOT NULL ENABLE,
"NAME" VARCHAR2(50 BYTE),
"PRICE" NUMBER(*,0),
"DESCRIPTION" VARCHAR2(180 BYTE),
"CREATE_DATE" DATE,
"UPDATE_DATE" DATE,
CONSTRAINT "PRODUCT_PK" PRIMARY KEY ("ID"))
i want to update data in this table, this is my stored procedure:
CREATE OR REPLACE PROCEDURE UPDATEPRODUCT(prod_id int, prod_name varchar2 default null, prod_price int default null) AS
BEGIN
update product
set
name = prod_name,
price = prod_price,
update_date = sysdate
where id = prod_id;
commit;
END UPDATEPRODUCT;
im using optional parameters, how can i update only 1 column? for example: only "NAME" or "PRICE".
Use COALESCE (or NVL) to keep the current value when a NULL value is passed in (or the default is used):
CREATE OR REPLACE PROCEDURE UPDATEPRODUCT(
prod_id PRODUCT.ID%TYPE,
prod_name PRODUCT.NAME%TYPE DEFAULT NULL,
prod_price PRODUCT.PRICE%TYPE DEFAULT NULL
)
AS
BEGIN
UPDATE product
SET name = COALESCE(prod_name, name),
price = COALESCE(prod_price, price),
update_date = SYSDATE
WHERE id = prod_id;
END UPDATEPRODUCT;
Also, do not COMMIT in a stored procedure as it prevents you from chaining multiple procedures together in a single transaction and rolling them all back as a block. Instead, COMMIT from the PL/SQL block that calls the procedure.
You can use NVL function here. So your updated procedure would look alike -
CREATE OR REPLACE PROCEDURE UPDATEPRODUCT(prod_id int,
prod_name varchar2 default null,
prod_price int default null) AS
BEGIN
UPDATE product
SET name = NVL(prod_name, name),
price = NVL(prod_price, price),
update_date = sysdate
WHERE id = prod_id;
COMMIT;
EXCEPTION
WHEN OTHERS THEN
RAISE;
END UPDATEPRODUCT;

Dynamic tablename in teradata create statement

I am trying to create a dynamic table name using the following procedure in WhereScape RED:
SELECT CAST(CURRENT_DATE as format 'YYYYMMDD') into v_date;
SET v_tname = 'Anirban_Test' || v_date ||'030' ;
CREATE MULTISET TABLE [TABLEOWNER].[(SELECT * from v_tname)] NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
TARGET_JOB_NAME VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL)
UNIQUE PRIMARY INDEX UP_LOAD_PROTOCOL ( TARGET_JOB_NAME );
But the create statement is not working. Any help will be appreciated.

Identifier is too long while loading from SQL*Loader

I have a table structure like this
CREATE TABLE acn_scr_upload_header
(
FILE_RECORD_DESCRIPTOR varchar2(5) NOT NULL,
schedule_no Number(10) NOT NULL,
upld_time_stamp Date NOT NULL,
seq_no number NOT NULL,
filename varchar2(100) ,
schedule_date_time Date
);
When I try to load my file via SQL*Loader I'm getting an error on this value in the column filename: Stock_Count_Request_01122014010101.csv. The error is:
Error on table ACN_SCR_UPLOAD_HEADER, column FILENAME.
ORA-00972: identifier is too long".
If I try to insert the same value into the table using an INSERT statement it works fine.
My data file Stock_Count_Request_01122014010101.csv looks like
FHEAD,1,12345,20141103
FDETL,7,100,W,20141231,SC100,B,N,1,5
FTAIL,8,6
and control file
LOAD DATA
INFILE '$IN_DIR/$FILENAME'
APPEND
INTO TABLE ACN_SCR_UPLOAD_HEADER
WHEN FILE_RECORD_DESCRIPTOR = 'FHEAD'
FIELDS TERMINATED BY ","
TRAILING NULLCOLS
(
FILE_RECORD_DESCRIPTOR position(1),
LINE_NO FILLER,
schedule_no ,
schedule_date_time,
upld_time_stamp sysdate,
seq_no "TJX_STOCK_COUNT_REQ_UPLD_SEQ.NEXTVAL",
FILENAME constant ""
)

Resources