Performance for validating data against various database tables

Performance for validating data against various database tables - performance

My Problem:
I "loop" over a table into a local structure named ls_eban..
and with those information I must follow these instructions:
ls_eban-matnr MUST BE in table zmd_scmi_st01 ( 1. control Table (global) )
ls_eban-werks MUST BE in table zmd_scmi_st05 ( 2. control Table (global) )
ls_eban-knttp MUST BE in table zmd_scmi_st06 ( 3. control Table (global) )
I need a selection that is clear and performant. I actually have one, but it isn't performant at all.
My solution:
SELECT st01~matnr st05~werks st06~knttp
FROM zmd_scmi_st01 AS st01
INNER JOIN zmd_scmi_st05 AS st05
ON st05~werks = ls_eban-werks
INNER JOIN zmd_scmi_st06 AS st06
ON knttp = ls_eban-knttp
INTO TABLE lt_control
WHERE st01~matnr = ls_eban-matnr AND st01~bedarf = 'X'
AND st05~bedarf = 'X'.
I also have to say, that the control tables doesn't have any relation with each other (no primary key and no secondary key).

The first thing you should not do is have the select inside the loop. Instead of
loop at lt_eban into ls_eban.
Select ....
endloop.
You should do a single select.
if lt_eban[] is not initial.
select ...
into table ...
from ...
for all entries in lt_eban
where ...
endif.
There may be more inefficiencies to be corrected if we had more info ( as mentioned in the comment by vwegert. For instance, really no keys on the control tables? ) but the select in a loop is the first thing that jumps out at me.

Related

How to update a column based on count of another column from another table using any condition in visual foxpro?

SET STEP ON
Close Databases
Cd e:\ksv\Data
Use ohd IN 0 shared
Use cus IN 0 shared
SELECT * FROM cus inTO TABLE tempcus
ALTER table tempcus ADD COLUMN totalsold int
UPDATE tempcus SET totalsold=RECCOUNT(ohd.status='5') WHERE tempcus.customer=ohd.customer
SELECT * FROM tempcus INTO CURSOR cur
BROWSE
I have tried the above code and i am getting an error saying invalid table number , can someone help me with this.

RECCOUNT() function only gives you a record count for a workarea# or alias, e.g. RECCOUNT("ohd") will give total record count of ohd table.
You want something like:
SELECT COUNT(*) totalsold,cus.customer FROM cus JOIN ohd ON cus.customer=ohd.customer WHERE ohd.cstatus='5' INTO CURSOR cur GROUP BY cus.customer
BROWSE

In VFP, there is a REPLACE command which allows you to replace one or more fields based on whatever values, even if variable results from other queries... or fixed values. Ex: This works on whatever table is the current selected work area and whatever row it is on, unless you apply a scope clause (for condition).
Sample only for context of REPLACE command
use SomeOtherTable in 0 shared
select SomeOtherTable
replace SomeNumberField with 1.234, SomeStringField with 'Hello', etc...
or with condition (bogus, just to show you can apply to multiple rows.
replace SomeNumberField with SomeNumberField * 3 for StatusField = 'X'
Now, back to your original content. It appears you are trying to get a result temporary table with a total number of records from the OHD table where the status = 5. VFP allows you to run SQL-Select into temporary read-write "cursor" tables, that when closed will delete themselves, yet allows them to be modified (such as browse, or other direct manipulation such as with REPLACE command).
You can get the counts you are looking for with a left-join to a query result set. To help you see the pieces individually, I will do in steps so you can follow, then join into one final.
First, you want a count of all records in the OHD table with status = 5 per customer... the "o" and "c" are ALIAS references in the SQL queries below
SET STEP ON
Close Databases
Cd e:\ksv\Data
Use ohd IN 0 shared
Use cus IN 0 shared
select ;
o.customer, ;
count(*) NumberOfRecords ;
from ;
OHD o ;
where ;
o.status = '5' ;
group by ;
o.customer ;
into ;
cursor C_JustCountsPerCustomer READWRITE
The "into cursor" part above will create a workable table and give it the name of "C_JustCountsPerCustomer". I have always tried to use "C_" as a prefix to the table name for the sole purpose to know it is a temporary "CURSOR" result and not a real final table, but that is just my historical naming convention applied.
Now, if you did a browse of this result, you would see each customer's ID and how many with status = '5'. The resulting table "cursor" is like any other table opened and you could index as you need and browse, etc. But this only will give records that HAD status of '5'. But you could have more customers that never had a '5' status record.
Now, getting all your customers and their respective counts into one result table "cursor". I can take the above query and use within a SQL-Select via a LEFT-JOIN meaning, give me everything from the first table (left-side), regardless of a matching record found in the second table (right-side). But if there is a match to the right side, give me those values too.
select ;
c.*, ;
NVL( C_tmpResult.NumberOfRecords, 0000 ) as NumberOfRecords ;
from;
CUS c ;
LEFT JOIN ;
(select ;
o.customer, ;
count(*) NumberOfRecords ;
from ;
OHD o ;
where ;
o.status = '5' ;
group by ;
o.customer ) C_tmpResult ;
ON ;
c.customer = C_tmpResult.customer ;
into ;
cursor C_CusWithCounts readwrite
So, you can see the left-join uses the first query to get the counts, but the primary part of the query gets records from the customer table (alias "c") and is joined on the common customer id column. The "NVL()" states if there IS a value in the C_tmpResult table for the given customer, grab that. If not, assume a count of 0. Yes, I explicitly have 0000 to force a minimum final width to 4 digits in the result in case the first customer does not have any and it make the column only 1 digit wide.
Anyhow, at the end, you would have your result temporary table (cursor) with the customer information AND the count I think you are looking for. You should be able to do a browse and good to go.

Can I use FOR ALL ENTRIES with GROUP BY?

Currently the code looks something like this:
LOOP AT lt_orders ASSIGNING <fs_order>.
SELECT COUNT(*) AS cnt
FROM order_items
INTO <fs_order>-cnt
WHERE order_id = <fs_order>-order_id.
ENDLOOP.
It is the slowest part of the report. I want to speed it up.
How can I use FOR ALL ENTRIES with GROUP BY?

Check the documentation. You can't use GROUP BY. Maybe in this case, you could try selecting your items with FAE outside of the loop, then count them using a parallel cursor:
REPORT.
TYPES: BEGIN OF ty_result,
vbeln TYPE vbeln,
cnt TYPE i.
TYPES: END OF ty_result.
DATA: lt_headers TYPE SORTED TABLE OF ty_result WITH UNIQUE KEY vbeln,
lv_tabix TYPE sy-tabix VALUE 1.
"get the headers
SELECT vbeln FROM vbak UP TO 100 ROWS INTO CORRESPONDING FIELDS OF TABLE lt_headers.
"get corresponding items
SELECT vbeln, posnr FROM vbap FOR ALL ENTRIES IN #lt_headers
WHERE vbeln EQ #lt_headers-vbeln
ORDER BY vbeln, posnr
INTO TABLE #DATA(lt_items).
LOOP AT lt_headers ASSIGNING FIELD-SYMBOL(<h>).
LOOP AT lt_items FROM lv_tabix ASSIGNING FIELD-SYMBOL(<i>).
IF <i>-vbeln NE <h>-vbeln.
lv_tabix = sy-tabix.
EXIT.
ELSE.
<h>-cnt = <h>-cnt + 1.
ENDIF.
ENDLOOP.
ENDLOOP.
BREAK-POINT.
Or join header/item with a distinct count on the item id (whichever column that would be in your table).

You should be able to do something like
SELECT COUNT(order_item_id) AS cnt, order_id
FROM order_items
INTO CORRESPONDING FIELDS OF TABLE lt_count
GROUP BY order_id.
Assuming that order_item_id is a key in the order_items table. And assuming that lt_count has two fields: cnt of type int8 and order_id of same type as your other order_id fields
PS: then you can loop over lt_count and move the counts to lt_orders. Or the other way around. To speed up the loop, sort one of the tables and use READ ... BINARY SEARCH

I did with table KNB1 (customer master in company code), where we have customers, which are created in several company codes.
Please note, because of FOR ALL ENTRIES you have to SELECT the full key.
TYPES: BEGIN OF ty_knb1,
kunnr TYPE knb1-kunnr,
count TYPE i,
END OF ty_knb1.
TYPES: BEGIN OF ty_knb1_fae,
kunnr TYPE knb1-kunnr,
END OF ty_knb1_fae.
DATA: lt_knb1_fae TYPE STANDARD TABLE OF ty_knb1_fae.
DATA: lt_knb1 TYPE HASHED TABLE OF ty_knb1
WITH UNIQUE KEY kunnr.
DATA: ls_knb1 TYPE ty_knb1.
DATA: ls_knb1_db TYPE knb1.
START-OF-SELECTION.
lt_knb1_fae = VALUE #( ( kunnr = ... ) ). "add at least one customer which is created in several company codes
ls_knb1-count = 1.
SELECT kunnr bukrs
INTO CORRESPONDING FIELDS OF ls_knb1_db
FROM knb1
FOR ALL ENTRIES IN lt_knb1_fae
WHERE kunnr EQ lt_knb1_fae-kunnr.
ls_knb1-kunnr = ls_knb1_db-kunnr.
COLLECT ls_knb1 INTO lt_knb1.
ENDSELECT.

Create a range table for your lt_orders, like lt_orders_range.
Do select order_id, count( * ) where order_id in lt_orders_range.
If you think this is too much to create a range table, you will save a lot of performance by running just one select for all orders instead of single select for each order id.

Not directly, only through a CDS view
While all of the answers provide a faster solution than the one in the question, the fastest way is not mentioned.
If you have at least Netweaver 7.4, EHP 5 (and you should, it was released in 2014), you can use CDS views, even if you are not on HANA.
It still cannot be done directly, as OpenSQL does not allow FOR ALL ENTRIES with GROUP BY, and CDS views cannot handle FOR ALL ENTRIES. However, you can create one of each.
CDS:
#AbapCatalog.sqlViewName: 'zorder_i_fae'
DEFINE VIEW zorder_items_fae AS SELECT FROM order_items {
order_id,
count( * ) AS cnt,
}
GROUP BY order_id
OpenSQL:
SELECT *
FROM zorder_items_fae
INTO TABLE #DATA(lt_order_cnt)
FOR ALL ENTRIES IN #lt_orders
WHERE order_id = #lt_orders-order_id.
Speed
If lt_orders contains more than about 30% of all possible order_id values from table ORDER_ITEMS, the answer from iPirat is faster. (While using more memory, obviously)
However, if you need only a couple hunderd order_id values out of millions, this solution is about 10 times faster than any other answer, and 100 times faster than the original.

Delete duplicate rows from a BigQuery table

I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);

You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1

UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)

Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table

If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)

If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause

Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.

Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING
 Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message
 UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );

How to? Correct sql syntax for finding the next available identifier

I think I could use some help here from more experienced users...
I have an integer field name in a table, let's call it SO_ID in a table SO, and to each new row I need to calculate a new SO_ID based on the following rules
1) SO_ID consists of 6 letters where first 3 are an area code, and the last three is the sequenced number within this area.
309001
309002
309003
2) so the next new row will have a SO_ID of value
309004
3) if someone deletes the row with SO_ID value = 309002, then the next new row must recycle this value, so the next new row has got to have the SO_ID of value
309002
can anyone please provide me with either a SQL function or PL/SQL (perhaps a trigger straightaway?) function that would return the next available SO_ID I need to use ?
I reckon I could get use of keyword rownum in my sql, but the follwoing just doens't work properly
select max(so_id),max(rownum) from(
select (so_id),rownum,cast(substr(cast(so_id as varchar(6)),4,3) as int) from SO
where length(so_id)=6
and substr(cast(so_id as varchar(6)),1,3)='309'
and cast(substr(cast(so_id as varchar(6)),4,3) as int)=rownum
order by so_id
);
thank you for all your help!

This kind of logic is fraught with peril. What if two sessions calculate the same "next" value, or both try to reuse the same "deleted" value? Since your column is an integer, you'd probably be better off querying "between 309001 and 309999", but that begs the question of what happens when you hit the thousandth item in area 309?
Is it possible to make SO_ID a foreign key to another table as well as a unique key? You could pre-populate the parent table with all valid IDs (or use a function to generate them as needed), and then it would be a simple matter to select the lowest one where a child record doesn't exist.

well, we came up with this... sort of works.. concurrency is 'solved' via unique constraint
select min(lastnumber)
from
(
select so_id,so_id-LAG(so_id, 1, so_id) OVER (ORDER BY so_id) AS diff,LAG(so_id, 1, so_id) OVER (ORDER BY so_id)as lastnumber
from so_miso
where substr(cast(so_id as varchar(6)),1,3)='309'
and length(so_id)=6
order by so_id
)a
where diff>1;

Do you really need to compute & store this value at the time a row is inserted? You would normally be better off storing the area code and a date in a table and computing the SO_ID in a view, i.e.
SELECT area_code ||
LPAD( DENSE_RANK() OVER( PARTITION BY area_code
ORDER BY date_column ),
3,
'0' ) AS so_id,
<<other columns>>
FROM your_table
or having a process that runs periodically (nightly, for example) to assign the SO_ID using similar logic.

If your application is not pure sql, you could do this in application code (ie: Java code). This would be more straightforward.

If you are recycling numbers when rows are deleted, your base table must be consulted when generating the next number. "Legacy" pre-relational schemes that attempt to encode information in numbers are a pain to make airtight when numbers must be recycled after deletes, as you say yours must.
If you want to avoid having to scan your table looking for gaps, an after-delete routine must write the deleted number to a separate table in a "ReuseMe" column. The insert routine does this:
begins trans
selects next-number table for update
uses a reuseme number if available else uses the next number
clears the reuseme number if applicable or increments the next-number in the next-number table
commits trans

Ignoring the issues about concurrency, the following should give a decent start.
If 'traffic' on the table is low enough, go with locking the table in exclusive mode for the duration of the transaction.
create table blah (soc_id number(6));
insert into blah select 309000 + rownum from user_tables;
delete from blah where soc_id = 309003;
commit;
create or replace function get_next (i_soc in number) return number is
v_min number := i_soc* 1000;
v_max number := v_min + 999;
begin
lock table blah in exclusive mode;
select min(rn) into v_min
from
(select rownum rn from dual connect by level <= 999
minus
select to_number(substr(soc_id,4))
from blah
where soc_id between v_min and v_max);
return v_min;
end;

How to create select SQL statement that would produce "merged" dataset from two tables(Oracle DBMS)?

Here's my original question:
merging two data sets
Unfortunately I omitted some intircacies, that I'd like to elaborate here.
So I have two tables events_source_1 and events_source_2 tables. I have to produce the data set from those tables into resultant dataset (that I'd be able to insert into third table, but that's irrelevant).
events_source_1 contain historic event data and I have to do get the most recent event (for such I'm doing the following:
select event_type,b,c,max(event_date),null next_event_date
from events_source_1
group by event_type,b,c,event_date,null
events_source_2 contain the future event data and I have to do the following:
select event_type,b,c,null event_date, next_event_date
from events_source_2
where b>sysdate;
How to put outer join statement to fill the void (i.e. when same event_type,b,c found from event_source_2 then next_event_date will be filled with the first date found
GREATLY APPRECIATE FOR YOUR HELP IN ADVANCE.

Hope I got your question right. This should return the latest event_date of events_source_1 per event_type, b, c and add the lowest event_date of event_source_2.
Select es1.event_type, es1.b, es1.c,
Max(es1.event_date),
Min(es2.event_date) As next_event_date
From events_source_1 es1
Left Join events_source_2 es2 On ( es2.event_type = es1.event_type
And es2.b = es1.b
And es2.c = es1.c
)
Group By c1.event_type, c1.b, c1.c

You could just make the table where you need to select a max using a group by into a virtual table, and then do the full outer join as I provided in the answer to the prior question.
Add something like this to the top of the query:
with past_source as (
select event_type, b, c, max(event_date)
from event_source_1
group by event_type, b, c, event_date
)
Then you can use past_source as if it were an actual table, and continue your select right after the closing parens on the with clause shown.

I end up doing two step process: 1st step populates the data from event table 1, 2nd step MERGES the data between target (the dataset from 1st step) and another source. Please forgive me, but I had to obfuscate table name and omit some columns in the code below for legal reasons. Here's the SQL:
INSERT INTO EVENTS_TARGET (VEHICLE_ID,EVENT_TYPE_ID,CLIENT_ID,EVENT_DATE,CREATED_DATE)
select VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID,
max(EVENT_INITIATED_DATE) EVENT_DATE, sysdate CREATED_DATE
FROM events_source_1
GROUP BY VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID, sysdate;
Here's the second step:
MERGE INTO EVENTS_TARGET tgt
USING (
SELECT ee.VEHICLE_ID VEHICLE_ID, ee.POTENTIAL_EVENT_TYPE_ID POTENTIAL_EVENT_TYPE_ID, ee.CLIENT_ID CLIENT_ID,ee.POTENTIAL_EVENT_DATE POTENTIAL_EVENT_DATE FROM EVENTS_SOURCE_2 ee WHERE ee.POTENTIAL_EVENT_DATE>SYSDATE) src
ON (tgt.vehicle_id = src.VEHICLE_ID AND tgt.client_id=src.client_id AND tgt.EVENT_TYPE_ID=src.POTENTIAL_EVENT_TYPE_ID)
WHEN MATCHED THEN
UPDATE SET tgt.NEXT_EVENT_DATE=src.POTENTIAL_EVENT_DATE
WHEN NOT MATCHED THEN
insert (tgt.VEHICLE_ID,tgt.EVENT_TYPE_ID,tgt.CLIENT_ID,tgt.NEXT_EVENT_DATE,tgt.CREATED_DATE) VALUES (src.VEHICLE_ID, src.POTENTIAL_EVENT_TYPE_ID, src.CLIENT_ID, src.POTENTIAL_EVENT_DATE, SYSDATE)
;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Performance for validating data against various database tables - performance

Related

How to update a column based on count of another column from another table using any condition in visual foxpro?

Can I use FOR ALL ENTRIES with GROUP BY?

Delete duplicate rows from a BigQuery table

How to? Correct sql syntax for finding the next available identifier

How to create select SQL statement that would produce "merged" dataset from two tables(Oracle DBMS)?

Categories

Resources