How to understand entryDB in HAWQ - hawq

What is entryDB in HAWQ? What are the differences between entryDB process in master and Query Executor process in segment? And what kind of queries are running on entryDB?

EntryDB is a kind of query executor dispatched on master node. The difference between EntryDB and QE in segment is that EntryDB is able to access master catalog. Usually udf is dispatched to EntryDB.

The query with UDF or serial involved is sometimes planned to be executed on entrydb. Actually, UDF/serial might be dispatched to QD/QE/EntryDB and processed there in different plan.
Here is an example with serial. As you can see, it uses entrydb explicitly in both plan with orca/planner.
CREATE TABLE some_vectors (
id SERIAL,
x FLOAT8[]
);
NOTICE: CREATE TABLE will create implicit sequence "some_vectors_id_seq" for serial column "some_vectors.id"
CREATE TABLE
INSERT INTO some_vectors(x) VALUES
(ARRAY[1,0,0,0]),
(ARRAY[0,1,0,0]),
(ARRAY[0,0,1,0]),
(ARRAY[0,0,0,2]);
SET optimizer = on;
SET
EXPLAIN ANALYZE INSERT INTO some_vectors(x) VALUES (ARRAY[1,0,0,0]), (ARRAY[0,1,0,0]), (ARRAY[0,0,1,0]), (ARRAY[0,0,0,2]);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Insert (cost=0.00..0.31 rows=4 width=12)
Rows out: Avg 4.0 rows x 1 workers. Max/Last(seg0:rhuo-mbp/seg0:rhuo-mbp) 4/4 rows with 7.320/7.320 ms to first row, 7.331/7.331 ms to end, start offset by 1.718/1.718 ms.
Executor memory: 1K bytes.
-> Redistribute Motion 1:1 (slice1) (cost=0.00..0.00 rows=4 width=20)
Rows out: Avg 4.0 rows x 1 workers at destination. Max/Last(seg0:rhuo-mbp/seg0:rhuo-mbp) 4/4 rows with 1.044/1.044 ms to first row, 1.046/1.046 ms to end, start offset by 1.718/1.718 ms.
-> Assert (cost=0.00..0.00 rows=4 width=20)
Assert Cond: NOT id IS NULL
Rows out: Avg 4.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 4/4 rows with 0.577/0.577 ms to first row, 0.824/0.824 ms to end, start offset by 1.826/1.826 ms.
Executor memory: 1K bytes.
-> Result (cost=0.00..0.00 rows=4 width=20)
Rows out: Avg 4.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 4/4 rows with 0.569/0.569 ms to first row, 0.815/0.815 ms to end, start offset by 1.826/1.826 ms.
-> Append (cost=0.00..0.00 rows=4 width=8)
Rows out: Avg 4.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 4/4 rows with 0.360/0.360 ms to first row, 0.402/0.402 ms to end, start offset by 1.827/1.827 ms.
-> Result (cost=0.00..0.00 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 1/1 rows with 0.359/0.359 ms to first row, 0.360/0.360 ms to end, start offset by 1.827/1.827 ms.
-> Result (cost=0.00..0.00 rows=1 width=1)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 1/1 rows with 0/0 ms to end, start offset by 1.827/1.827 ms.
-> Result (cost=0.00..0.00 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 1/1 rows with 0.015/0.015 ms to end, start offset by 2.411/2.411 ms.
-> Result (cost=0.00..0.00 rows=1 width=1)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 1/1 rows with 0/0 ms to end, start offset by 2.411/2.411 ms.
-> Result (cost=0.00..0.00 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 1/1 rows with 0.012/0.012 ms to end, start offset by 2.500/2.500 ms.
-> Result (cost=0.00..0.00 rows=1 width=1)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 1/1 rows with 0/0 ms to end, start offset by 2.500/2.500 ms.
-> Result (cost=0.00..0.00 rows=1 width=8)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 1/1 rows with 0.013/0.013 ms to end, start offset by 2.581/2.581 ms.
-> Result (cost=0.00..0.00 rows=1 width=1)
Rows out: Avg 1.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 1/1 rows with 0/0 ms to end, start offset by 2.581/2.581 ms.
Slice statistics:
(slice0) Executor memory: 323K bytes (seg0:rhuo-mbp).
(slice1) Executor memory: 279K bytes (entry db).
Statement statistics:
Memory used: 262144K bytes
Settings: default_hash_table_bucket_number=6; optimizer=on
Optimizer status: PQO version 1.633
Dispatcher statistics:
executors used(total/cached/new connection): (2/2/0); dispatcher time(total/connection/dispatch data): (0.120 ms/0.000 ms/0.033 ms).
dispatch data time(max/min/avg): (0.026 ms/0.005 ms/0.015 ms); consume executor data time(max/min/avg): (0.023 ms/0.013 ms/0.018 ms); free executor time(max/min/avg): (0.000 ms/0.000 ms/0.000 ms).
Data locality statistics:
data locality ratio: 1.000; virtual segment number: 1; different host number: 1; virtual segment number per host(avg/min/max): (1/1/1); segment size(avg/min/max): (560.000 B/560 B/560 B); segment size with penalty(avg/min/max): (560.000 B/560 B/560 B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 6.804 ms; resource allocation: 0.549 ms; datalocality calculation: 0.083 ms.
Total runtime: 31.656 ms
(42 rows)
SET optimizer = off;
SET
EXPLAIN ANALYZE INSERT INTO some_vectors(x) VALUES (ARRAY[1,0,0,0]), (ARRAY[0,1,0,0]), (ARRAY[0,0,1,0]), (ARRAY[0,0,0,2]);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Insert (slice0; segments: 1) (rows=4 width=32)
-> Redistribute Motion 1:1 (slice1) (cost=0.00..0.07 rows=4 width=32)
Rows out: Avg 4.0 rows x 1 workers at destination. Max/Last(seg0:rhuo-mbp/seg0:rhuo-mbp) 4/4 rows with 1.212/1.212 ms to first row, 1.215/1.215 ms to end, start offset by 1.643/1.643 ms.
-> Values Scan on "*VALUES*" (cost=0.00..0.07 rows=4 width=32)
Rows out: Avg 4.0 rows x 1 workers. Max/Last(seg-1:rhuo-mbp/seg-1:rhuo-mbp) 4/4 rows with 0.628/0.628 ms to first row, 0.888/0.888 ms to end, start offset by 1.848/1.848 ms.
Slice statistics:
(slice0) Executor memory: 255K bytes (seg0:rhuo-mbp).
(slice1) Executor memory: 201K bytes (entry db).
Statement statistics:
Memory used: 262144K bytes
Settings: default_hash_table_bucket_number=6; optimizer=off
Optimizer status: legacy query optimizer
Dispatcher statistics:
executors used(total/cached/new connection): (2/2/0); dispatcher time(total/connection/dispatch data): (0.118 ms/0.000 ms/0.025 ms).
dispatch data time(max/min/avg): (0.018 ms/0.006 ms/0.012 ms); consume executor data time(max/min/avg): (0.723 ms/0.022 ms/0.372 ms); free executor time(max/min/avg): (0.000 ms/0.000 ms/0.000 ms).
Data locality statistics:
data locality ratio: 1.000; virtual segment number: 1; different host number: 1; virtual segment number per host(avg/min/max): (1/1/1); segment size(avg/min/max): (280.000 B/280 B/280 B); segment size with penalty(avg/min/max): (280.000 B/280 B/280 B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 0.053 ms; resource allocation: 0.560 ms; datalocality calculation: 0.073 ms.
Total runtime: 33.478 ms
(18 rows)

Related

Oracle script to create dynamic query based on SQL held in a field

Where a user gives a set of inputs from one table, e.g. "request_table" a:
User Input
Value
Field Name in Database
Product
Deposit
product_type
Deposit Term (months)
24
term
Deposit Amount
200,000
amount
Customer Type
Charity
customer_type
Existing Customer
Y
existing_customer
Would like to use the product selection to pick out SQL scripts embedded in a "pricing_table" b, where the price is made up of components, each of which are affected by one or more of the above inputs:
Product
Grid
Measures
Value1
Value1Min
Value1Max
Value2
Value2Min
Value2Max
Price
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
0
12
0
100000
1
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
12
36
0
100000
2
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
36
9999
0
100000
3
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
0
12
100000
500000
1.1
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
12
36
100000
500000
2.1
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
36
9999
100000
500000
3.1
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
0
12
500000
99999999
1.2
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
12
36
500000
99999999
2.2
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
36
9999
500000
99999999
3.2
Deposit
Customer_Type
a.customer_type=b.value1
Personal
0
Deposit
Customer_Type
a.customer_type=b.value1
Charity
0.1
Deposit
Customer_Type
a.customer_type=b.value1
Business
-0.1
Deposit
Existing_Customer
a.existing_customer=b.value1
Y
0.1
Deposit
Existing_Customer
a.existing_customer=b.value1
N
0
Where the query is: select distinct measures from pricing_table where product=(select product_type from request_table). This gives multiple rows where SQL logic is held.
Would like to run this SQL logic in a LOOP, e.g.:
select b.* from pricing_table b where :measures
This would return all rows where the specific metrics are matched.
Doing it this way as the exact columns in the input can grow to hundreds, so don't want a really wide table.
Any help appreciated thanks.
I've creating tables but am unsure how to loop the measures, and apply the values from that field in a looped query thanks.
In a PL/SQL pipelined function, you can build the SQL query and open a cursor on it, loop on the results and PIPE the rows.

Can I use 0 and 1 values in quantilesExact

From the quantile function documentation:
We recommend using a level value in the range of [0.01, 0.99]. Don't use a level value equal to 0 or 1 – use the min and max functions for these cases.
Does this also applies for quantileExact and quantilesExact functions?
In my experiments, I've found that quantileExact(0) = min and quantileExact(1) = max, but cannot be sure about it.
That recommendation is not about accuracy but about complexity of quantile*.
quantileExact is much much heavier than max min.
See the time difference, min / max 8 times faster even on a small dataset.
create table Speed Engine=MergeTree order by X
as select number X from numbers(1000000000);
SELECT min(X), max(X) FROM Speed;
┌─min(X)─┬────max(X)─┐
│ 0 │ 999999999 │
└────────┴───────────┘
1 rows in set. Elapsed: 1.040 sec. Processed 1.00 billion rows, 8.00 GB (961.32 million rows/s., 7.69 GB/s.)
SELECT quantileExact(0)(X), quantileExact(1)(X) FROM Speed;
┌─quantileExact(0)(X)─┬─quantileExact(1)(X)─┐
│ 0 │ 999999999 │
└─────────────────────┴─────────────────────┘
1 rows in set. Elapsed: 8.561 sec. Processed 1.00 billion rows, 8.00 GB (116.80 million rows/s., 934.43 MB/s.)
It turns out it is safe to use 0 and 1 values for quantileExact and quantilesExact functions.

Hive TABLESAMPLE on clustered table

I want to ask the correct bucketing and tablesample way.
There is a table X which I created by
CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC
Then I inserted 500M of rows into X by
set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW
Then I want to count or search some rows with condition. roughly,
SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
But I'd better to USE tablesample as I clustered X (action_id, classifier).
So, the better query will be
SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'
Is there any wrong above?
But I can't not find any performance gain between these two query.
query1 and RESULT( with no tablesample.)
SELECT COUNT(*)) from X
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.35 s
--------------------------------------------------------------------------------
It scans full data.
query 2 and RESULT
SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.82 s
--------------------------------------------------------------------------------
It ALSO scans full data.
query 2 RESULT WHAT I EXPECTED.
Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.xx s
--------------------------------------------------------------------------------
Values of action_id and classifier are well distributed and there is no skewed data.
So I want to ask you what will be a correct query to target Only 1 Bucket and Use 1 map??

How to calculate Total average response time

Below are the results
sampler_label count average median 90%_line min max
Transaction1 2 61774 61627 61921 61627 61921
Transaction2 4 82 61 190 15 190
Transaction3 4 1862 1317 3612 1141 3612
Transaction4 4 1242 915 1602 911 1602
Transaction5 4 692 608 906 423 906
Transaction6 4 2764 2122 4748 1182 4748
Transaction7 4 9369 9029 11337 7198 11337
Transaction8 4 1245 890 2168 834 2168
Transaction9 4 3475 2678 4586 2520 4586
TOTAL 34 6073 1381 9913 15 61921
My question here is how is total average response time is being calculated (which is 6073)?
Like in my result I want to exclude transaction1 response time and then want to calculate Total average response time.
How can I do that?
Total Avg Response time = ((s1*t1) + (s2*t2)...)/s
s1 = No of times transaction 1 was executed
t1 = Avg response time for transaction 1
s2 = No of times transaction 2 was executed
t2 = Avg response time for transaction 2
s = Total no of samples (s1+s2..)
In your case, except transaction1 all other transactions have been executed 4 times. So, simple avg of (82, 1862, 1242...) should give the result you wanted.

GAE Datastore Performance (Column vs ListProperty)

After watching "Google IO 2009: Building scalable, complex apps on App Engine" I performed some tests to help understand the impact on list de-serialization, but the results are quite surprising. Below are the test descriptions.
All tests are run on GAE server.
Each test is performed 5 times with its time and CPU usage recorded.
This test is to compare the speed of fetching (float) data in Columns V.S List
Both Column and List tables contain an extra datetime column for query.
Same query is used to fetch data on both Column and List tables.
TEST 1
- Fetch Single Row
- Table size: 500 Columns vs List of 500 (both contain 500 rows)
Table:ChartTestDbRdFt500C500R <-- 500 Columns x 500 Rows
OneRowCol Result <-- Fetching one row
[0] 0.02 (52) <-- Test 0, time taken = 0.02, CPU usage = 52
[1] 0.02 (60)
[2] 0.02 (56)
[3] 0.01 (46)
[4] 0.02 (57)
Table:ChartTestDbRdFt500L500R <-- List of 500 x 500 Rows
OneRowLst Result
[0] 0.01 (40)
[1] 0.02 (38)
[2] 0.01 (42)
[3] 0.05 (154)
[4] 0.01 (41)
TEST 2
- Fetch All Rows
- Table size: 500 Columns vs List of 500 (both contain 500 rows)
Table:ChartTestDbRdFt500C500R
AllRowCol Result
[0] 11.54 (32753)
[1] 10.99 (31140)
[2] 11.07 (31245)
[3] 11.55 (37177)
[4] 10.96 (34300)
Table:ChartTestDbRdFt500L500R
AllRowLst Result
[0] 7.46 (20872)
[1] 7.02 (19632)
[2] 6.8 (18967)
[3] 6.33 (17709)
[4] 6.81 (19006)
TEST 3
- Fetch Single Row
- Table size: 4500 Columns vs List of 4500 (both contain 10 rows)
Table:ChartTestDbRdFt4500C10R
OneRowCol Result
[0] 0.15 (419)
[1] 0.15 (433)
[2] 0.15 (415)
[3] 0.23 (619)
[4] 0.14 (415)
Table:ChartTestDbRdFt4500L10R
OneRowLst Result
[0] 0.08 (212)
[1] 0.16 (476)
[2] 0.07 (215)
[3] 0.09 (242)
[4] 0.08 (217)
CONCLUSION
Fetching a list of N items is actually quicker than N columns. Does anyone know why this is the case? I thought there is a performance hit on list de-serialization? Or did I performed my tests incorrectly? Any insight will be helpful, thanks!
BigTable is a column-oriented database.
That means that fetching a 'row' of N columns is in fact N different read operations, all on the same index.

Resources