I have two tables, adv_institution and institution. institution has 5000+ rows, while adv_institution has 1400+
I want to use Oracle MERGE to back-fill records to adv_institution from institution. These two tables have about four fields tin common which I can use to back-fill.
Here is my entire MERGE statement
merge into
adv_institution to_t
using (
select
uni.*,
adv_c.country_cd as con_code_text
from
(
select
institution_cd,
name,
institution_status,
country_cd
from
institution uni
where
uni.institution_status = 'ACTIVE' and
uni.country_cd is not null
group by
institution_cd,
name,
institution_status,
country_cd
order by
name
) uni,
country_cd c_cd,
adv_country adv_c
where
uni.country_cd = c_cd.country_cd and
c_cd.description = adv_c.country_cd
) from_t
on
(
to_t.VENDOR_INSTITUTION_CD = from_t.INSTITUTION_CD or
to_t.INSTITUTION_CD = from_t.NAME
)
WHEN NOT MATCHED THEN INSERT (
to_t.INSTITUTION_CD,
to_t.INSTITUTION_NAME,
to_t.SHORT_NAME,
to_t.COUNTRY_CD,
to_t.NOTE,
to_t.UNIT_TERMINOLOGY,
to_t.COURSE_TERMINOLOGY,
to_t.CLOSED_IND,
to_t.UPDATE_WHO,
to_t.UPDATE_ON,
to_t.CALLISTA_INSTITUTION_CD
)
VALUES (
from_t.NAME,
from_t.NAME,
'',
from_t.con_code_text,
'',
'UNIT',
'COURSE',
'N',
'MYUSER',
SYSDATE,
from_t.institution_cd
);
The error I got is
Error report -
ORA-00001: unique constraint (MYUSER.ADI_PK) violated
ADI_PK means adv_institution.institution_cd is a primary key and it must be unique.
That is because in WHEN NOT MATCHED THEN INSERT there is an insert statement. I insert from_t.NAME into to_t.INSTITUTION_CD.
It looks like from_t.NAME has the same value at least twice, when inserting into to_t.INSTITUTION_CD
But I did a group statement to make sure from_t.NAME is unique:
(
select
institution_cd,
name,
institution_status,
country_cd
from
institution uni
where
uni.institution_status = 'ACTIVE' and
uni.country_cd is not null
group by
institution_cd,
name,
institution_status,
country_cd
order by
name
) uni
I am not sure I understand the issue correctly. I tried all I can, but still no luck.
I think your main issue is with group by.
Please consider below example:
desc temp_inventory;
Name Type
--------------------- -----------
WAREHOUSE_NO NUMBER(2)
ITEM_NO NUMBER(10)
ITEM_QUANTITY NUMBER(10)
WAREHOUSE_NO ITEM_NO ITEM_QUANTITY
1 1000 100
1 2000 200
1 2000 300
If i write a query where I want warehouse_no to be unique:
select warehouse_no,item_quantity
from temp_inventory
group by warehouse_no,item_quantity
Its going to return the same 3 rows.. instead i want to group by..
select warehouse_no,sum(item_quantity)
from temp_inventory
group by warehouse_no
which will make the warehouse_no unique in this situation !
Also in cases where you have VARCHAR2 columns, you can use MAX, MIN on them as aggregate functions along with group by to make a unique key in the query.
Example:
Select object_type, min(object_name)
from user_objects group by object_type;
which will make the object_type unique & return only 1 corresponding object name for it.
So note that if there are duplicate's, in the end some records will be eliminated based on the aggregate function.
"But I did a group statement to make sure from_t.NAME is unique:"
But your query does not do that. It produces a set of distinct combinations of (institution_cd,name,institution_status,country_cd). Clearly such a set could contain multiple recurrences of name, one for each different value of country_cd. As you have four elements in your key you are virtually guaranteeing that your set will have multiple occurrences of name.
You compound this with the or in the ON conditions, which means you trigger the UNMATCHED logic if to_t.VENDOR_INSTITUTION_CD = from_t.INSTITUTION_CD even though there is already a record in the target table where to_t.INSTITUTION_CD = from_t.NAME.
The problem is that the MERGE statement is atomic. The set of records coming from the USING subquery must contain unique keys. When Oracle finds a second occurrence of the same name in the result set it doesn't say, I've already merged one of those, let's skip it. It has to hurl ORA-00001 because there is no way for Oracle to know which record is apply, which combination of (institution_cd,name,institution_status,country_cd) is the correct one.
To solve this you need to change the USING query to produce a result set with unique keys. It's your data model, you understand its business rules, so you're in the position to rewrite it properly. But maybe something like this:
select
name,
max(institution_cd) as institution_cd,
institution_status,
max(country_cd) as country_cd
from (
institution uni
where
uni.institution_status = 'ACTIVE' and
uni.country_cd is not null
group by
name,
institution_status
order by
name
) uni
Then you can simplify the MERGE ON clause to:
on
(
to_t.INSTITUTION_CD = from_t.NAME
)
The use of MAX() in the subquery is an inelegant kludge. I hope you can apply better business rules.
Related
I want to check whether the array of values (4690, 4693) both is exists in the contextid column without using functions as the table contains more that a million records
Table structure:
ID
CONTEXTID
4
4690
5
4690
6
4693
7
4693
8
4690
What about this query?
select
case when count(distinct CONTEXTID) = 2 then 'Y' else 'N' end as contains_4690_4693
from tab
where CONTEXTID in (4690, 4693)
it retuns Y if both keys are in the table at least once, N otherwise.
If you just want to find out if they exist then
SELECT DISTINCT(CONTEXTID)
FROM SOME_TABLE
wHERE CONTEXTID IN (4690, 4693)
will do it. If CONTEXTID isn't indexed, though, the database will have to do a full table scan which will probably be slow.
Takeaway: add an index on CONTEXTID or live with the fact that it's going to be slow.
If the "test" values are known when you write the query (as they very rarely are - even though all the solutions presented so far make the implicit assumption that they are), you could do something like this - which is probably the most efficient way, regardless of whether there is an index on the relevant column or not:
select case when exists
( select *
from sys.odcinumberlist(4690, 4693)
where column_value not in ( select contextid
from the_table
where context_id is not null
)
) then 'Not all found' else 'All found' end as result
from dual
;
Note how I gave an array of input values to the query - I used the sys.odcinumberlist constructor. You will have to clarify how you plan to "input" an array of "test" values.
Currently the code looks something like this:
LOOP AT lt_orders ASSIGNING <fs_order>.
SELECT COUNT(*) AS cnt
FROM order_items
INTO <fs_order>-cnt
WHERE order_id = <fs_order>-order_id.
ENDLOOP.
It is the slowest part of the report. I want to speed it up.
How can I use FOR ALL ENTRIES with GROUP BY?
Check the documentation. You can't use GROUP BY. Maybe in this case, you could try selecting your items with FAE outside of the loop, then count them using a parallel cursor:
REPORT.
TYPES: BEGIN OF ty_result,
vbeln TYPE vbeln,
cnt TYPE i.
TYPES: END OF ty_result.
DATA: lt_headers TYPE SORTED TABLE OF ty_result WITH UNIQUE KEY vbeln,
lv_tabix TYPE sy-tabix VALUE 1.
"get the headers
SELECT vbeln FROM vbak UP TO 100 ROWS INTO CORRESPONDING FIELDS OF TABLE lt_headers.
"get corresponding items
SELECT vbeln, posnr FROM vbap FOR ALL ENTRIES IN #lt_headers
WHERE vbeln EQ #lt_headers-vbeln
ORDER BY vbeln, posnr
INTO TABLE #DATA(lt_items).
LOOP AT lt_headers ASSIGNING FIELD-SYMBOL(<h>).
LOOP AT lt_items FROM lv_tabix ASSIGNING FIELD-SYMBOL(<i>).
IF <i>-vbeln NE <h>-vbeln.
lv_tabix = sy-tabix.
EXIT.
ELSE.
<h>-cnt = <h>-cnt + 1.
ENDIF.
ENDLOOP.
ENDLOOP.
BREAK-POINT.
Or join header/item with a distinct count on the item id (whichever column that would be in your table).
You should be able to do something like
SELECT COUNT(order_item_id) AS cnt, order_id
FROM order_items
INTO CORRESPONDING FIELDS OF TABLE lt_count
GROUP BY order_id.
Assuming that order_item_id is a key in the order_items table. And assuming that lt_count has two fields: cnt of type int8 and order_id of same type as your other order_id fields
PS: then you can loop over lt_count and move the counts to lt_orders. Or the other way around. To speed up the loop, sort one of the tables and use READ ... BINARY SEARCH
I did with table KNB1 (customer master in company code), where we have customers, which are created in several company codes.
Please note, because of FOR ALL ENTRIES you have to SELECT the full key.
TYPES: BEGIN OF ty_knb1,
kunnr TYPE knb1-kunnr,
count TYPE i,
END OF ty_knb1.
TYPES: BEGIN OF ty_knb1_fae,
kunnr TYPE knb1-kunnr,
END OF ty_knb1_fae.
DATA: lt_knb1_fae TYPE STANDARD TABLE OF ty_knb1_fae.
DATA: lt_knb1 TYPE HASHED TABLE OF ty_knb1
WITH UNIQUE KEY kunnr.
DATA: ls_knb1 TYPE ty_knb1.
DATA: ls_knb1_db TYPE knb1.
START-OF-SELECTION.
lt_knb1_fae = VALUE #( ( kunnr = ... ) ). "add at least one customer which is created in several company codes
ls_knb1-count = 1.
SELECT kunnr bukrs
INTO CORRESPONDING FIELDS OF ls_knb1_db
FROM knb1
FOR ALL ENTRIES IN lt_knb1_fae
WHERE kunnr EQ lt_knb1_fae-kunnr.
ls_knb1-kunnr = ls_knb1_db-kunnr.
COLLECT ls_knb1 INTO lt_knb1.
ENDSELECT.
Create a range table for your lt_orders, like lt_orders_range.
Do select order_id, count( * ) where order_id in lt_orders_range.
If you think this is too much to create a range table, you will save a lot of performance by running just one select for all orders instead of single select for each order id.
Not directly, only through a CDS view
While all of the answers provide a faster solution than the one in the question, the fastest way is not mentioned.
If you have at least Netweaver 7.4, EHP 5 (and you should, it was released in 2014), you can use CDS views, even if you are not on HANA.
It still cannot be done directly, as OpenSQL does not allow FOR ALL ENTRIES with GROUP BY, and CDS views cannot handle FOR ALL ENTRIES. However, you can create one of each.
CDS:
#AbapCatalog.sqlViewName: 'zorder_i_fae'
DEFINE VIEW zorder_items_fae AS SELECT FROM order_items {
order_id,
count( * ) AS cnt,
}
GROUP BY order_id
OpenSQL:
SELECT *
FROM zorder_items_fae
INTO TABLE #DATA(lt_order_cnt)
FOR ALL ENTRIES IN #lt_orders
WHERE order_id = #lt_orders-order_id.
Speed
If lt_orders contains more than about 30% of all possible order_id values from table ORDER_ITEMS, the answer from iPirat is faster. (While using more memory, obviously)
However, if you need only a couple hunderd order_id values out of millions, this solution is about 10 times faster than any other answer, and 100 times faster than the original.
Is there any method to reduce the time taken to get the result from below query?
Please help. Thanks in advance!
select status, count(distinct id)
from emp
where id >=
( select min(id)
from emp
where id >= (select max(id-200000) from emp)
and trunc(join_date) >= '01-Mar-2018')
group by status;
Use analytic functions - this will perform only a single table scan (whereas your query has three table/index scans):
SELECT status,
COUNT( DISTINCT id )
FROM (
SELECT status,
id,
MIN( CASE WHEN join_date >= DATE '2018-03-01' THEN id END ) OVER () AS min_id
FROM (
SELECT status,
id,
join_date,
MAX( id ) OVER () AS max_id
FROM emp
)
WHERE id >= max_id - 20000
)
WHERE id >= min_id
GROUP BY status;
Also, you can use a date literal (rather than relying on implicit conversion of a string to a date using the NLS_DATE_FORMAT session parameter) and you do not need to use the TRUNC() function (since that may prevent Oracle using an index on the join_date column and would instead require a function-based index).
It is important to know if id is a primary key (as columns with that name usually are) or not. If it is not, you definitely need an index on id for it to perform (and I would also wonder what the purpose of the column was). If id is the primary key, you don't need to the distinct as the values will be unique anyway.
The select min(id) sub-select is redundant as you already found max(id - 200000) so you don't need to know the first min(id) greater than that. You can just use >= by itself (with the condition on the date added). By the way, I would write max(id) - 200000 instead; on some databases, it might work better.
The date comparison may be problematic. You should try an index on join_date if you haven't got one already, but the trunc might stop that from being used, so it would be best to remove that and make the other side of the compare use a TO_TIMESTAMP or TO_DATE to generate a corresponding literal as appropriate, setting the time to midnight.
But there can be problems with comparing timestamps due to timezones, etc. I'd need to know more about your setup to know whether that is likely to be a problem.
I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);
You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)
Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table
If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)
If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause
Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.
Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING
Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist
select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message
UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries
DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );
A weird request maybe but. My boss wants me to create an admin version of a page we have that displays data from an oracle query in a table.
The admin page, instead of displaying the data (query returns 1 row), needs to return the table name and column name
Ex: Instead of:
Name Initial
==================
Bob A
I want:
Name Initial
============================
Users.FirstName Users.MiddleInitial
I realize I can do this in code but would rather just modify the query to return the data I want so I can leave the report generation code mostly alone.
I don't want to do it in a stored procedure.
So when I spit out the data in the report using something like:
blah blah = MyDataRow("FirstName")
I can leave that as is but instead of it displaying "BOB" it would display "Users.FirstName"
And I want to do the query using select * if possible instead of listing all the columns
So for each of the columns I am querying in the * , I want to get (instead of the column value) the tablename.ColumnName or tablename|columnName
hope you are following- I am confusing myself...
pseudo:
select tablename + '.' + Columnname as WhateverTheColumnNameIs
from Table1
left join Table2 on whatever...
Join Table_Names on blah blah
Whew- after writing all this I think I will just do it on the code side.
But if you are up for it maybe a fun challenge
Oracle does not provide an authentic way(there is no pseudocolumn) to get the column name of a table as a result of a query against that table. But you might consider these two approaches:
Extract column name from an xmltype, formed by passing cursor expression(your query) in the xmltable() function:
-- your table
with t1(first_name, middle_name) as(
select 1,2 from dual
), -- your query
t2 as(
select * -- col1 as "t1.col1"
--, col2 as "t1.col2"
--, col3 as "t1.col3"
from hr.t1
)
select *
from ( select q.object_value.getrootelement() as col_name
, rownum as rn
from xmltable('//*'
passing xmltype(cursor(select * from t2 where rownum = 1))
) q
where q.object_value.getrootelement() not in ('ROWSET', 'ROW')
)
pivot(
max(col_name) for rn in (1 as "name", 2 as "initial")
)
Result:
name initial
--------------- ---------------
FIRST_NAME MIDDLE_NAME
Note: In order for column names to be prefixed with table name, you need to list them
explicitly in the select list of a query and supply an alias, manually.
PL/SQL approach. Starting from Oracle 11g you could use dbms_sql() package and describe_columns() procedure specifically to get the name of columns in the cursor(your select).
This might be what you are looking for, try selecting from system views USER_TAB_COLS or ALL_TAB_COLS.