I have a table with products names in PostgreSql database. Total rows is ~30M. And I have history of prices in ClickHouse. I want to join names to prices.
DDL to create dictionary:
CREATE DICTIONARY products_dict
(
product_id String,
name String
)
PRIMARY KEY product_id
SOURCE(POSTGRESQL(
...
query 'SELECT product_id, name FROM products'
))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(3600);
Then I have dictionary:
database: wdm
name: products_dict
uuid: 1464ba09-990c-4e69-9464-ba09990c0e69
status: LOADED
origin: 1464ba09-990c-4e69-9464-ba09990c0e69
type: ComplexKeyHashed
key.names: ['product_id']
key.types: ['String']
attribute.names: ['name']
attribute.types: ['String']
bytes_allocated: 4831830312
query_count: 57912282
hit_rate: 1
found_rate: 1
element_count: 28956140
load_factor: 0.4314801096916199
source: PostgreSQL: ...
lifetime_min: 0
lifetime_max: 3600
loading_start_time: 2022-01-17 03:53:21
last_successful_update_time: 2022-01-17 03:54:46
loading_duration: 84.79
last_exception:
comment:
Also I've got table for this dictionary:
-- auto-generated definition
create table products_dict
(
product_id String,
name String
)
engine = Dictionary;
When I query this dictionary, it tooks ~3 sec.
One id with WHERE IN
SELECT name FROM products_dict WHERE product_id IN ('97646221')
1 row retrieved starting from 1 in 2 s 891 ms (execution: 2 s 841 ms, fetching: 50 ms)
501 products without conditions and sorting
SELECT t.*
FROM products_dict t
LIMIT 501
500 rows retrieved starting from 1 in 2 s 616 ms (execution: 2 s 601 ms, fetching: 15 ms)
JOIN
SELECT ppd.*, p.name
FROM
(
SELECT
product_id,
price
FROM product_prices_daily
WHERE
product_id IN ('97646221','97646318','976464823','97647223','976472425','976474961','976476908')
AND day between '2022-01-13' and '2022-01-14'
) as ppd
LEFT JOIN products_dict as p ON p.product_id = ppd.product_id
4 rows retrieved starting from 1 in 6 s 984 ms (execution: 6 s 959 ms, fetching: 25 ms)
DBMS: ClickHouse (ver. 21.12.3.32)
Client: DataGrip 2021.3.2
Server: 128 GB RAM, dozens of cores, 3TB SSD without any load.
Reading from 16 billion MergeTree table by product_id tooks ~100ms.
I've tested manually created table with engine=dictionary and got the same results.
I cannot use flat layout because product_id is string.
Another test with clickhouse-client:
ch01 :) SELECT name FROM products_dict WHERE product_id IN ('97646239');
SELECT name
FROM products_dict
WHERE product_id IN ('97646239')
Query id: d4f467c9-be0e-4619-841b-a76251d3e714
┌─name──┐
│ ...│
└───────┘
1 rows in set. Elapsed: 2.859 sec. Processed 28.96 million rows, 2.30 GB (10.13 million rows/s., 803.25 MB/s.)
What's wrong?
Such optimization is not implemented, yet.
Initially supposed that dictionaries to be used with only dictGet functions.
Table representation were introduced much later.
Internally Dictionaries are the set of hash tables -- if your dictionary has 50 attributes then it will be 50 hash tables. These hash tables are very fast if you do seek by key, but very slow if you need to find the next element.
Right now the query SELECT name FROM products_dict WHERE product_id IN ('97646239') is executed in very straightforward way, though it could be converted into dictGet under the hood.
Related
I am currently doing some testing and am in the need for a large amount of data (around 1 million rows)
I am using the following table:
CREATE TABLE OrderTable(
OrderID INTEGER NOT NULL,
StaffID INTEGER,
TotalOrderValue DECIMAL (8,2)
CustomerID INTEGER);
ALTER TABLE OrderTable ADD CONSTRAINT OrderID_PK PRIMARY KEY (OrderID)
CREATE SEQUENCE seq_OrderTable
MINVALUE 1
START WITH 1
INCREMENT BY 1
CACHE 10000;
and want to randomly insert 1000000 rows into it with the following rules:
OrderID needs to be be sequential (1, 2, 3 etc...)
StaffID needs to be a random number between 1 and 1000
CustomerID needs to be a random number between 1 and 10000
TotalOrderValue needs to be a random decimal value between 0.00 and 9999.99
Is this even possible to do? I can I could generate all of these using this update statement? however generating a million rows in 1 go I am not sure on how to do this
Thanks for any help on this matter
This is how i would randomly generate the number on update:
UPDATE StaffTable SET DepartmentID = DBMS_RANDOM.value(low => 1, high => 5);
For testing purposes I created the table and populated it in one shot, with this query:
CREATE TABLE OrderTable(OrderID, StaffID, CustomerID, TotalOrderValue)
as (select level, ceil(dbms_random.value(0, 1000)),
ceil(dbms_random.value(0,10000)),
round(dbms_random.value(0,10000),2)
from dual
connect by level <= 1000000)
/
A few notes - it is better to use NUMBER as data type, NUMBER(8,2) is the format for decimal. It is much more efficient for populating this kind of table to use the "hierarchical query without PRIOR" trick (the "connect by level <= ..." trick) to get the order ID's.
If your table is created already, insert into OrderTable (select level...) (same subquery as in my code) should work just as well. You may be better off adding the PK constraint only after you create the data though, so as not to slow things down.
A small sample from the table created (total time to create the table on my cheap laptop - 1,000,000 rows - was 7.6 seconds):
SQL> select * from OrderTable where orderid between 500020 and 500030;
ORDERID STAFFID CUSTOMERID TOTALORDERVALUE
---------- ---------- ---------- ---------------
500020 666 879 6068.63
500021 189 6444 1323.82
500022 533 2609 1847.21
500023 409 895 207.88
500024 80 2125 1314.13
500025 247 3772 5081.62
500026 922 9523 1160.38
500027 818 5197 5009.02
500028 393 6870 5067.81
500029 358 4063 858.44
500030 316 8134 3479.47
I have two queries which I need to join for business purpose to make them one-step process
SELECT
empid,
assest_x_id,
asset_y_id
FROM emp e
WHERE e.empid = 'SOME_UNIQUE_VAL';
result:
EMPID ASSEST_X_ID ASSET_Y_ID
======= ============ ===========
1234 abc pqr
-- Even though millions rows in table this will return 1 row always within milliseconds as it's using PK column with unique value.
Now there is another table for asset_values in separate DB current market price
(also a million rows)
SELECT
asset_id,
assest_type,
asset_current_price
FROM asset_values#x_db a
WHERE (asset_id, assest_type) IN (('abc', 'X'), ('pqr', 'Y'));
result:
asset_id asset_type assest_current_price
======== ========= =============
abc X 10000
pqr Y 5000
This will also return 2-3 rows always in few millisecs as Primary key is defined for combination of asset_id,asset_type values and there exists only 3 type of assets as X/Y/Z.
(Note: Not possible to normalize this table further data in business rules)
**Now to make a one step process query in script I tried to join these queries which can take empid from user and get all desired reults.
But now problem is that when I try to merge these two in single query like below runs for 15+ mins to give results**
SELECT
a.asset_id,
a.asset_type,
asset_current_price
FROM asset_values#x_db a, emp b
WHERE b.empid = 'SAME_UNIQUE_VAL'
AND (asset_id, asset_type) IN ((b.asset_x_id, 'X'), (b.asset_y_id, 'Y'));
Surprisingly Explain Plan is also good. (bytes:597 cost:2)
Can someone please give your expert advice on this?
SELECT STATEMENT ALL_ROWSCost: 6 Bytes: 690 Cardinality: 2
13 CONCATENATION
6 NESTED LOOPS Cost: 3 Bytes: 345 Cardinality: 1
3 PARTITION RANGE SINGLE Cost: 1 Bytes: 2,232 Cardinality: 9 Partition #: 3 Partitions accessed #1
2 TABLE ACCESS BY LOCAL INDEX ROWID TABLE MYPRDAOWN.EMP Object Instance: 2 Cost: 1 Bytes: 2,232 Cardinality: 9 Partition #: 4 Partitions accessed #1
1 INDEX RANGE SCAN INDEX MYPRDAOWN.EMP_7IX Cost: 1 Cardinality: 9 Partition #: 5 Partitions accessed #1
5 FILTER Cost: 1 Bytes: 97 Cardinality: 1
4 REMOTE REMOTE SERIAL_FROM_REMOTE ASSEST_VALUES XDB
12 NESTED LOOPS Cost: 3 Bytes: 345 Cardinality: 1
9 PARTITION RANGE SINGLE Cost: 1 Bytes: 2,232 Cardinality: 9 Partition #: 9 Partitions accessed #1
8 TABLE ACCESS BY LOCAL INDEX ROWID TABLE MYPRDAOWN.EMP Object Instance: 2 Cost: 1 Bytes: 2,232 Cardinality: 9 Partition #: 10 Partitions accessed #1
7 INDEX RANGE SCAN INDEX MYPRDAOWN.EMP_7IX Cost: 1 Cardinality: 9 Partition #: 11 Partitions accessed #1
11 FILTER Cost: 1 Bytes: 97 Cardinality: 1
10 REMOTE REMOTE SERIAL_FROM_REMOTE ASSEST_VALUES XDB
from http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i49732:
Nested loop joins are useful when small subsets of data are being joined and if the join condition is an efficient way of accessing the second table.
Use hash joins which:
... are used for joining large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows.
to implement it use hint
SELECT /*+ use_hash(a,b) */ a.asset_id,
a.asset_type,
asset_current_price
FROM asset_values#x_db a,
emp b
WHERE b.empid = 'SAME_UNIQUE_VAL'
AND (asset_id, asset_type) IN ((b.asset_x_id, 'X'), (b.asset_y_id, 'Y'));
I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
hive> describe foo;
OK
orderno string
Time taken: 0.101 seconds, Fetched: 1 row(s)
Sample data for table1:
hive>select * from foo;
OK
1826203307
1826207803
1826179498
1826179657
Table 2:
hive> describe de_geo_ip_logs;
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
Time taken: 0.047 seconds, Fetched: 4 row(s)
Sample data for Table 2:
hive> select * from bar;
127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163
My Query:
SELECT b.itemcode
FROM foo a, bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."
EDIT:
A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.
Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
SELECT b.itemcode
FROM foo a JOIN bar b on <SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.
I have a table which holds more then 2 million records, I am trying to update a table using following query
UPDATE toc T
SET RANK =
65535
- (SELECT COUNT (*)
FROM toc T2
WHERE S_KEY LIKE '00010001%'
AND A_ID IS NOT NULL
AND T2.TARGET = T.TARGET
AND T2.RANK > T.RANK)
WHERE S_KEY LIKE '00010001%' AND A_ID IS NOT NULL
Usually this query tooks 5 mins to update 50000 rows in our staging db which is a exact replica of production db but in our production db it is taking 6 hours to execute...
I tried Oracle advisory to select the correct execution plan but nothing is working...
Plan
UPDATE STATEMENT ALL_ROWSCost: 329,471
6 UPDATE TT.TOC
2 TABLE ACCESS BY INDEX ROWID TABLE TT.TOC Cost: 5 Bytes: 4,173,236 Cardinality: 54,911
1 INDEX SKIP SCAN INDEX TT.DATASTAT_SORTKEY_IDX Cost: 4 Cardinality: 1
5 SORT AGGREGATE Bytes: 76 Cardinality: 1
4 TABLE ACCESS BY INDEX ROWID TABLE TT.TOC Cost: 5 Bytes: 76 Cardinality: 1
3 INDEX SKIP SCAN INDEX TT.DATASTAT_SORTKEY_IDX Cost: 4 Cardinality: 1
I can see the following wait events
1,066 db file sequential read 10,267 0 3,993 0 6 39,933,580
1,066 db file scattered read 413 0 188 0 6 1,876,464
Any help will be greatly appreciated.
here is the current list of indexes
DSTAT_SKEY_IDX D_STATUS 1
DSTAT_SKEY_IDX S_KEY 2
IDX$$_165A0002 N_LABEL 1
S_KEY_IDX S_KEY 1
XAK1_TOC N_RELATIONSHIP 1
XAK2_TOC TARGET 1
XAK2_TOC N_LABEL 2
XAK2_TOC D_STATUS 3
XAK2_TOC A_ID 4
XIE1_TOC N_RELBASE 1
XIF4_TOC SOURCE_FILE_ID 1
XIF5_TOC A_ID 1
XPK_TOC N_ID 1
Atif
You're doing a skip scan where you supposedly want to do a range scan.
A range scan is only possible when the index columns are ordered by descending selectivity - in your case it seems that it should be S_KEY - TARGET - RANK
Update: rewriting the query in different order wouldn't make any difference. What matters is the sequence of the columns in the indexes of that table.
first show us the current index columns for that table:
select index_name, column_name, column_position from all_ind_columns where table_name = 'TOC'
then you could create a new index, e.g.
create index toc_i_s_key_target_rank on toc (s_key, target, rank) compress;
Suppose I have the following tables
Target table
sales
ID ItemNum DiscAmt OrigAmt
1 123 20.00 NULL
2 456 30.00 NULL
3 123 20.00 NULL
Source Table
prices
ItemNum OrigAmt
123 25.00
456 35.00
I tried to update the OrigAmt in the Target Table using the OrigAmt in the Source Table using
UPDATE
( SELECT s.OrigAmt dests
,p.OrigAmt srcs
FROM sales s
LEFT JOIN prices p
ON s.ItemNum = p.ItemNum
) amnts
SET amnts.dests = amnts.srcs
;
but i get: ORA-01779: cannot modify a column which maps to a non key-preserved table
i also tried using the merge but i get: ORA-30926: unable to get a stable set of rows in the source tables
You cannot generally UPDATE the result of an arbitrary SELECT.
Single statement, assuming ItemNum is a primary key for prices:
UPDATE sales WHERE (SELECT count(price.ItemNum) FROM price
WHERE price.ItemNum = sales.ItemNum) > 0
SET OrigAmt =
(SELECT MAX(OrigAmt) FROM price
WHERE price.ItemNum = sales.ItemNum)
You might get away with omitting the WHERE and/or MAX.
Less convoluted: loop a cursor over
SELECT ItemNum, OrigAmt FROM price
performing a number of updates for each ItemNum from table prices:
UPDATE sales SET OrigAmt=? WHERE ItemNum=?