Teradata 3848 and small int with COMPRESS - sql-order-by

I am learning Teradata and have run into an issue when I UNION two queries. The error I run into is error 3848, "The ORDER BY clause must contain only integer constants".
I checked the table definition and all the columns I have been retrieving are Small Ints with identical definitions, except for one which uses COMPRESS with a long series of consecutive numbers starting from 3.
SELECT
COALESCE (ContractType, 'InvalidType') AS "Contract",
COALESCE (ContractStatus, 'InvalidStatus') AS "Status",
COUNT(ContractType) AS "Contract_Type_Count",
COUNT (ContractStatus) AS "Contract_Status_Count"
NULL AS "negCodeErr_count"
FROM fund_inventory_db.ContractDetail
GROUP BY CUBE (ContractType, ContractStatus)
UNION
NULL,
NULL,
NULL,
NULL,
SELECT COUNT(*)
FROM fund_inventory_db.ContractDetail
WHERE ContractSource = -2
ORDER BY ContractType, ContractStatus;
The definitions for all those fields look like this:
[...columnName...] SMALLINT NOT NULL DEFAULT 0
Except for one column, which is:
[...columnName...] SMALLINT NOT NULL DEFAULT 0 COMPRESS (3,4,5,6,7,8...)
Does using COMPRESS like this make it possible that they are not able to order normally? As in, if one column uses COMPRESS(3,4,5,6,7...) and the other either uses COMPRESS (1,2,3,4,5...) or does not use COMPRESS at all, would that make a difference?
This might be embarrassing, but is it actually possible to use a UNION where one of the queries is using CUBE()?
Sorry, this is all new to me and my mentor is moving a little fast! I sincerely appreciate your time.

Related

Complex procedure to adjust data continously

I solved this in SQL Server with a trigger. Now I face it on Oracle.
I have a big set of data that periodically increases with new items.
The item has these fundamental columns:
ID string identifier (not null)
DATETIME (not null)
optional (eventually null, always null for type 1) DATETIME_EMIS emission datetame equal to the DATETIME of the corresponding emission item
type (0 or 1)
value (only if type 1)
It is basically a logbook.
For example: An item with ID='FIREALARM' and datetime='2023-02-12 12:02' has closing like this:
ID='FIREALARM' in datetime='2023-02-12 15:11', emission datetime='2023-02-12 12:02' (equal to the emission item).
What I need is to obtain a final item in the destination table like this:
ID='FIREALARM' in DATETIME_BEGIN ='2023-02-12 12:02', DATETIME_END ='2023-02-12 15:11'
Not all the items have the closing datetime (the ones of Type=1 instead 0), in this case the next item should be use to close the previous one (with the problem of finding it). For example:
Item1:
ID='DEVICESTATUS', datetime='2023-02-12 22:11', Value='Broken' ;
Item2:
ID='DEVICESTATUS', datetime='2023-02-12 22:14', Value='Running'
Should result in
ID='DEVICESTATUS', DATETIME_BEGIN ='2023-02-12 22:11',DATETIME_END ='2023-02-12 22:14', Value='Broken'
The final data should be extracted by a select query as faster as possible.
The process of the elaboration should be independent from the order of inserting.
In SQL Server, I created a trigger with several operations which involve a temporary table, some queries on the inserted set and the entire destination table, so a complex procedure that is not worth to be shown to understand the problem.
Now I discovered that Oracle has some limitations and is not easy to port the trigger on it. For example is not easy to use a temporary table in the same way, and the operation are for each row.
I am asking what could be a good strategy in Oracle to elaborate the data in the final form considering that the set increase continuously and the open and the closure items must be reduce to a single item. I am not asking for a solution of the problem, I am trying to understand what could be the instrument in Oracle useful to achieve a complex elaboration like this. Thanks.
From Oracle 12, you can use MATCH_RECOGNIZE to perform row-by-row pattern matching:
SELECT *
FROM destination
MATCH_RECOGNIZE(
PARTITION BY id
ORDER BY datetime
MEASURES
FIRST(datetime) AS datetime_begin,
LAST(datetime) AS datetime_end,
FIRST(value) AS value
PATTERN ( ^ any_row+ $ )
DEFINE
any_row AS 1 = 1
)
Which, for the sample data:
CREATE TABLE destination (id, datetime, value) AS
SELECT 'DEVICESTATUS', DATE '2023-02-12' + INTERVAL '22:11' HOUR TO MINUTE, 'Broken' FROM DUAL UNION ALL
SELECT 'DEVICESTATUS', DATE '2023-02-12' + INTERVAL '22:14' HOUR TO MINUTE, 'Running' FROM DUAL;
Outputs:
ID
DATETIME_BEGIN
DATETIME_END
VALUE
DEVICESTATUS
2023-02-12 22:11:00
2023-02-12 22:14:00
Broken
fiddle

Oracle error: NULL columns: expression must have same datatype as corresponding expression -

I am trying to append two tables together, they don't have quite the same columns but contain data for the same clients. the table called "outcome" contains survey results from clients collected in 1 month and the table "checkpoint" contains survey results from clients collected six months after. I tried to append those two tables and ensured that there is the same number of columns by introducing NULL columns so that the number of columns match in both tables here is my query:
tbl_out AS (
SELECT
PRG_NAME,
CASEREFERENCE,
STARTDATE,
STATUS,
ENDDATE,
LASTWRITTEN,
CLOSURE_REASON,
--these columns were made to match the tbl_check table--
TO_CHAR(NULL) AS Reviewer,
TO_DATE(NULL) AS Month_Schedule_Date,
TO_CHAR(NULL) AS Month_REASON,
TO_DATE(NULL) AS Month_Start_Date,
TO_DATE(NULL) AS Month_End_Date,
TO_CHAR(NULL) AS MONTH_Resubmit_MILESTONE,
TO_CHAR(NULL) AS MONTH_MILESTONE_ACHIEVED,
TO_DATE(NULL) AS MONTH_APPROVED_DATE,
--at 1 month--
OUTCOME_DATE,
Outcome_Reference_ID,
Outcome_EMP_SITUATION,
Outcome_Work_Job_Business,
Outcome_Employment_Type,
Outcome_NUM_JOBS,
Outcome_NAICS,
Outcome_NAICS_Desc,
Outcome_NOC,
Outcome_NOC_Desc,
Outcome_JOB_Nature,
TO_NUMBER(Outcome_WORK_HOURS),
TO_NUMBER(Outcome_WAGE),
Outcome_Change_Employment,
TO_NUMBER(Outcome_NUM_EMP_Change),
Outcome_LAST_UNEMP_DATE,
Outcome_Attend_School,
Outcome_STUDENT_STATUS,
Outcome_STUDENT_Type,
Outcome_EMP_CATEGORIES,
Outcome_Got_Service,
Outcome_Right_Service,
Outcome_Seek_Help_Again,
Outcome_Recommend_Program,
Outcome_Didnot_Seek_Employment
FROM outcome
),
tbl_check AS (
SELECT
PRG_NAME,
CASEREFERENCE,
STARTDATE,
STATUS,
ENDDATE,
LASTWRITTEN,
CLOSURE_REASON,
--info from tbl_out--
TO_DATE(NULL) AS OUTCOME_DATE,
--at 6 months--
TO_CHAR(Reviewer),
TO_DATE(Month_Schedule_Date),
TO_CHAR(Month_REASON),
TO_DATE(Month_Start_Date),
TO_DATE(Month_End_Date),
month_Review_Reference_ID,
month_outcome AS MONTH_EMP_SITUATION,
Month_Work_Job_Business,
Month_Outcome_Employment_Type,
month_NUM_JOBS,
month_NAICS,
Month_NAICS_Desc,
MONTH_NOC,
Month_NOC_Desc,
Month_JOB_Nature,
TO_NUMBER(MONTH_WORK_HOURS),
TO_NUMBER(MONTH_WAGE),
Month_Change_Employment,
TO_NUMBER(Month_NUM_EMP_Change),
MONTH_LAST_UNEMP_DATE,
Month_Attend_School,
Month_STUDENT_STATUS,
Month_STUDENT_Type,
Month_EMP_CATEGORIES,
Month_Got_Service,
Month_Right_Service,
Month_Seek_Help_Again,
Month_Recommend_Program,
Month_Didnot_Seek_Employment,
TO_CHAR(MONTH_Resubmit_MILESTONE),
TO_CHAR(MONTH_MILESTONE_ACHIEVED),
TO_DATE(MONTH_APPROVED_DATE)
FROM checkpoint
)
SELECT * FROM tbl_out
UNION
SELECT * FROM tbl_check
however, I still get this error:
I was wondering if anyone could please tell me how I can fix my query so that the query runs properly? Thank you
The columns in your two CTEs are in different orders. For example, in the first CTE Reviewer is the 8th column, but in the second CTE it's the 9th column. That's causing different datatypes to be in matching positions, not just what looks like non-matching data.
When you do:
SELECT * FROM ...
the projection has the columns in the order they are defined in the CTE; it doesn't automatically reorder them based on name, say; there's no requirement for the names to be the same (and they aren't the same for a lot of your columns).
Rearrange the columns in one or both CTEs to they align properly. Or list the columns in each select list instead of using *, but in this case that's probably not helpful. In general avoid *, but it is sometimes a valid and sensible choice.
This is nothing to do with the nulls, other than you've maybe put those in the wrong place.
Incidentally really, and somewhat personally, rather than doing things like:
TO_CHAR(NULL) AS Reviewer,
TO_DATE(NULL) AS Month_Schedule_Date,
I would usually cast to the right data type:
CAST(NULL AS VARCHAR2(30)) AS Reviewer,
CAST(NULL AS DATE) AS Month_Schedule_Date,
etc., matching the target data type - including string length and number scale/precision for clarity. It's somewhat a matter of taste; but there are four versions of to_char(), which all return varchar2, but it still feels a bit ambiguous.

Oracle SQL Query Performance, Function based Indexes

I have been trying to fine tune a SQL Query that takes 1.5 Hrs to process approx 4,000 error records. The run time increases along with the number of rows.
I figured out there is one condition in my SQL that is actually causing the issue
AND (DECODE (aia.doc_sequence_value,
NULL, DECODE(aia.voucher_num,
NULL, SUBSTR(aia.invoice_num, 1, 10),
aia.voucher_num) ,
aia.doc_sequence_value) ||'_' ||
aila.line_number ||'_' ||
aida.distribution_line_number ||'_' ||
DECODE (aca.doc_sequence_value,
NULL, DECODE(aca.check_voucher_num,
NULL, SUBSTR(aca.check_number, 1, 10),
aca.check_voucher_num) ,
aca.doc_sequence_value)) = " P_ID"
(P_ID - a value from the first cursor sql)
(Note that these are standard Oracle Applications(ERP) Invoice tables)
P_ID column is from the staging table that is derived the same way as above derivation and compared here again in the second SQL to get the latest data for that record. (Basically reprocessing the error records, the value of P_ID is something like "999703_1_1_9995248" )
Q1) Can I create a function based index on the whole left side derivation? If so what is the syntax.
Q2) Would it be okay or against the oracle standard rules, to create a function based index on standard Oracle tables? (Not creating directly on the table itself)
Q3) If NOT what is the best approach to solve this issue?
Briefly, no you can't place a function-based index on that expression, because the input values are derived from four different tables (or table aliases).
What you might look into is a materialised view, but that's a big and potentially difficult to solve a single query optimisation problem with.
You might investigate decomposing that string "999703_1_1_9995248" and applying the relevant parts to the separate expressions:
DECODE(aia.doc_sequence_value,
NULL,
DECODE(aia.voucher_num,
NULL, SUBSTR(aia.invoice_num, 1, 10),
aia.voucher_num) ,
aia.doc_sequence_value) = '999703' and
aila.line_number = '1' and
aida.distribution_line_number = '1' and
DECODE (aca.doc_sequence_value,
NULL,
DECODE(aca.check_voucher_num,
NULL, SUBSTR(aca.check_number, 1, 10),
aca.check_voucher_num) ,
aca.doc_sequence_value)) = '9995248'
Then you can use indexes on the expressions and columns.
You could separate the four components of the P_ID value using regular expressions, or a combination of InStr() and SubStr()
Ad 1) Based on the SQL you've posted, you cannot create function based index on that. The reason is that function based indexes must be:
Deterministic - i.e. the function used in index definition has to always return the same result for given input arguments, and
Can only use columns from the table the index is created for. In your case - based on aliases you're using - you have four tables (aia, aila, aida, aca).
Req #2 makes it impossible to build a functional index for that expression.

Synchronizing two tables using a stored procedure, only updating and adding rows where values do not match

The scenario
I've got two tables with identical structure.
TABLE [INFORMATION], [SYNC_INFORMATION]
[ITEM] [nvarchar](255) NOT NULL
[DESCRIPTION] [nvarchar](255) NULL
[EXTRA] [nvarchar](255) NULL
[UNIT] [nvarchar](2) NULL
[COST] [float] NULL
[STOCK] [nvarchar](1) NULL
[CURRENCY] [nvarchar](255) NULL
[LASTUPDATE] [nvarchar](50) NULL
[IN] [nvarchar](4) NULL
[CLIENT] [nvarchar](255) NULL
I'm trying to create a synchronize procedure that will be triggered by a scheduled event at a given time every day.
CREATE PROCEDURE [dbo].[usp_SynchronizeInformation]
AS
BEGIN
SET NOCOUNT ON;
--Update all rows
UPDATE TARGET_TABLE
SET TARGET_TABLE.[DESCRIPTION] = SOURCE_TABLE.[DESCRIPTION],
TARGET_TABLE.[EXTRA] = SOURCE_TABLE.[EXTRA],
TARGET_TABLE.[UNIT] = SOURCE_TABLE.[UNIT],
TARGET_TABLE.[COST] = SOURCE_TABLE.[COST],
TARGET_TABLE.[STOCK] = SOURCE_TABLE.[STOCK],
TARGET_TABLE.[CURRENCY] = SOURCE_TABLE.[CURRENCY],
TARGET_TABLE.[LASTUPDATE] = SOURCE_TABLE.[LASTUPDATE],
TARGET_TABLE.[IN] = SOURCE_TABLE.[IN],
TARGET_TABLE.[CLIENT] = SOURCE_TABLE.[CLIENT]
FROM SYNC_INFORMATION TARGET_TABLE
JOIN LSERVER.dbo.INFORMATION SOURCE_TABLE ON TARGET_TABLE.ITEMNO = SOURCE_TABLE.ITEMNO
WHERE TARGET_TABLE.ITEMNO = SOURCE_TABLE.ITEMNO
--Add new rows
INSERT INTO SYNC_INFORMATION (ITEMNO, DESCRIPTION, EXTRA, UNIT, STANDARDCOST, STOCKTYPE, CURRENCY_ID, LASTSTANDARDUPDATE, IN_ID, CLIENTCODE)
SELECT
src.ITEM,
src.DESCRIPTION,
src.EXTRA,
src.UNIT,
src.COST,
src.STOCKTYPE,
src.CURRENCY_ID,
src.LASTUPDATE,
src.IN,
src.CLIENT
FROM LSERVER.dbo.INFORMATION src
LEFT JOIN SYNC_INFORMATION targ ON src.ITEMNO = targ.ITEMNO
WHERE
targ.ITEMNO IS NULL
END
Currently, this procedure (including some others that are also executed at the same time) takes about 15 seconds to execute.
I'm planning on adding a "Synchronize" button in my work interface so that users can manually synchronize when, for instance, a new item is added and needs to be used the same day.
But in order for me to do that, I need to trim those 15 seconds as much as possible.
Instead of updating every single row, like in my procedure, is it possible to only update rows that have values that does not match?
This would greatly increase the execution speed, since it doesn't have to update all the 4000 rows when maybe only 20 actually needs it.
Can this be done in a better way, or optimized?
Does it need improvements, if yes, where?
How would you solve this?
Would also appreciate some time differences between the solutions so I can compare them.
UPDATE
Using marc_s's CHECKSUM is really brilliant. The problem is that in some instances the information creates the same checksum. Here's an example, due to the classified content, I can only show you 2 columns, but I can say that all columns have identical information except these 2. To clarify: this screenshot is of all the rows that had duplicate CHECKSUMs. These are also the only rows with a hyphen in the ITEM column, I've looked.
The query was simply
SELECT *, CHECKSUM(*) FROM SYNC_INFORMATION
If you can change the table structure ever so slightly - you could add a computed CHECKSUM column to your two tables, and in the case the ITEM is identical, you could then check that checksum column to see if there are any differences at all in the columns of the table.
If you can do this - try something like this here:
ALTER TABLE dbo.[INFORMATION]
ADD CheckSumColumn AS CHECKSUM([DESCRIPTION], [EXTRA], [UNIT],
[COST], [STOCK], [CURRENCY],
[LASTUPDATE], [IN], [CLIENT]) PERSISTED
Of course: only include those columns that should be considered when making sure whether a source and a target row are identical ! (this depends on your needs and requirements)
This persists a new column to your table, which is calculated as the checksum over the columns specified in the list of arguments to the CHECKSUM function.
This value is persisted, i.e. it could be indexed, too! :-O
Now, you could simplify your UPDATE to
UPDATE TARGET_TABLE
SET ......
FROM SYNC_INFORMATION TARGET_TABLE
JOIN LSERVER.dbo.INFORMATION SOURCE_TABLE ON TARGET_TABLE.ITEMNO = SOURCE_TABLE.ITEMNO
WHERE
TARGET_TABLE.ITEMNO = SOURCE_TABLE.ITEMNO
AND TARGET_TABLE.CheckSumColumn <> SOURCE_TABLE.CheckSumColumn
Read more about the CHECKSUM T-SQL function on MSDN!

Performance tuning on reading huge table

I have a huge table with more than one hundred million of rows and I have to query this table to return a set of data in a minimum of time.
So I have created a test environment with this table definition:
CREATE TABLE [dbo].[Test](
[Dim1ID] [nvarchar](20) NOT NULL,
[Dim2ID] [nvarchar](20) NOT NULL,
[Dim3ID] [nvarchar](4) NOT NULL,
[Dim4ID] [smalldatetime] NOT NULL,
[Dim5ID] [nvarchar](20) NOT NULL,
[Dim6ID] [nvarchar](4) NOT NULL,
[Dim7ID] [nvarchar](4) NOT NULL,
[Dim8ID] [nvarchar](4) NOT NULL,
[Dim9ID] [nvarchar](4) NOT NULL,
[Dim10ID] [nvarchar](4) NOT NULL,
[Dim11ID] [nvarchar](20) NOT NULL,
[Value] [decimal](21, 6) NOT NULL,
CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED
(
[Dim1ID] ASC,
[Dim2ID] ASC,
[Dim3ID] ASC,
[Dim4ID] ASC,
[Dim5ID] ASC,
[Dim6ID] ASC,
[Dim7ID] ASC,
[Dim8ID] ASC,
[Dim9ID] ASC,
[Dim10ID] ASC,
[Dim11ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
This table is the fact table of Star schema architecture (fact/dimensions). As you can see I have a clustered index on all the columns except for the “Value” column.
I have filled this data with approx. 10,000,000 rows for testing purpose. The fragmentation is currently at 0.01%.
I would like to improve the performance when reading a set of rows from this table using this query:
DECLARE #Dim1ID nvarchar(20) = 'C1'
DECLARE #Dim9ID nvarchar(4) = 'VRT1'
DECLARE #Dim10ID nvarchar(4) = 'S1'
DECLARE #Dim6ID nvarchar(4) = 'FRA'
DECLARE #Dim7ID nvarchar(4) = '' -- empty = all
DECLARE #Dim8ID nvarchar(4) = '' -- empty = all
DECLARE #Dim2 TABLE ( Dim2ID nvarchar(20) NOT NULL )
INSERT INTO #Dim2 VALUES ('A1'), ('A2'), ('A3'), ('A4');
DECLARE #Dim3 TABLE ( Dim3ID nvarchar(4) NOT NULL )
INSERT INTO #Dim3 VALUES ('P1');
DECLARE #Dim4ID TABLE ( Dim4ID smalldatetime NOT NULL )
INSERT INTO #Dim4ID VALUES ('2009-01-01'), ('2009-01-02'), ('2009-01-03');
DECLARE #Dim11 TABLE ( Dim11ID nvarchar(20) NOT NULL )
INSERT INTO #Dim11 VALUES ('Var0001'), ('Var0040'), ('Var0060'), ('Var0099')
SELECT RD.Dim2ID,
RD.Dim3ID,
RD.Dim4ID,
RD.Dim5ID,
RD.Dim6ID,
RD.Dim7ID,
RD.Dim8ID,
RD.Dim9ID,
RD.Dim10ID,
RD.Dim11ID,
RD.Value
FROM dbo.Test RD
INNER JOIN #Dim2 R
ON RD.Dim2ID = R.Dim2ID
INNER JOIN #Dim3 C
ON RD.Dim3ID = C.Dim3ID
INNER JOIN #Dim4ID P
ON RD.Dim4ID = P.Dim4ID
INNER JOIN #Dim11 V
ON RD.Dim11ID = V.Dim11ID
WHERE RD.Dim1ID = #Dim1ID
AND RD.Dim9ID = #Dim9ID
AND ((#Dim6ID <> '' AND RD.Dim6ID = #Dim6ID) OR #Dim6ID = '')
AND ((#Dim7ID <> '' AND RD.Dim7ID = #Dim7ID) OR #Dim7ID = '')
AND ((#Dim8ID <>'' AND RD.Dim8ID = #Dim8ID) OR #Dim8ID = '')
I have tested this query and that’s returned 180 rows with these times:
1st execution: 1 min 32 sec; 2nd execution: 1 min.
I would like to return the data in a few seconds if it’s possible.
I think I can add the non-clustered indexes but I am not sure what the best way is to set the non-clustered indexes!
If having sorted order data in this table could improve the performances?
Or are there other solutions than indexes?
Thanks.
Consider your datatypes as one problem. Do you need nvarchar? It's measurably slower
Second problem: the PK is wrong for your query, It should be Dim1ID, Dim9ID first (or vice versa based on selectivity). or some flavour with the JOIN columns in.
Third problem: use of OR. This construct usually works despite what nay-sayers who don't try it will post.
RD.Dim7ID = ISNULL(#Dim7ID, RD.Dim7ID)
This assumes #Dim7ID is NULL though. The optimiser will short circuit it in most cases.
I'm with gbn on this. Typically in star schema data warehouses, the dimension IDs are int, which is 4 bytes. Not only are all your dimensions larger than that, the nvarchar are both varying and using wide characters.
As far as indexing, just one clustering index may be fine since in the case of your fact table, you really don't have many facts. As gbn says, with your particular example, your index needs to be in the order of the columns which you are going to be providing so that the index can actually be used.
In a real-world case of a fact table with a number of facts, your clustered index is simply for data organization - you'll probably be expecting some non-clustered indexes for specific usages.
But I'm worried that your query specifies an ID parameter. Typically in a DW environment, you don't know the IDs, for selective queries, you select based on the dimensions, and the ids are meaningless surrogates:
SELECT *
FROM fact
INNER JOIN dim1
ON fact.dim1id = dim1.id
WHERE dim1.attribute = ''
Have you looked at Kimball's books on dimensional modeling? I think if you are going to a star schema, you should probably be familiar with his design techniques, as well as the various pitfalls he discusses with the too many and too few dimensions.
see this: Dynamic Search Conditions in T-SQL Version for SQL 2008 (SP1 CU5 and later)
quick answer, if you are on the right service pack of SQL Server 2008, is to
try adding that to the end of the query:
OPTION(RECOMPILE)
when on the proper service pack of SQL Server 2008, the OPTION(RECOMPILE) will build the execution plan based on the runtime value of the local variables.
For people still using SQl Server 2008 without the proper service packs or still on 2005 see: Dynamic Search Conditions in T-SQLVersion for SQL 2005 and Earlier
I'd be a little concerned about having all the non-value columns in your clustered index. That will make for a large index in the non-leaf levels. And, that key will be used in the nonclustered indexes. And, it will only provide any benefit when [Dim1ID] is included in the query. So, even if you're only optimizing this query, you're probably getting a full scan.
I would consider a clustered index on the most-commonly used key, and if you have a lot of date-related queries (e.g., date between a and b), go with the date key. Then, create non clustered indexes on the other key values.

Resources