See number of live/dead tuples in MonetDB - monetdb

I'm trying to get some precise row counts for all tables, given that some have deleted rows. I have been using sys.storage.count. But this seems to count the deleted ones also.
I assume using sys.storage would be simpler and faster than looping through count(*) queries, though both strategies may be fine in practice.
Maybe there is some column that counts modifications so I could just subtract the two counts?

If all you need to know is the number of actual rows in a table, I'd recommend just using a count(*) query. It's very fast. Even if you have N tables, it's easy to do a count(*) for each table.
sys.storage gives you information from the raw storage. With that, you can get pretty low-level information, but it has some edges. sys.storage.count returns the count in the storage, hence, indeed, it includes the delete rows since they are not actually deleted. As of Jul2021 version of MonetDB, deleted rows are automatically overwritten by new inserts (i.e. auto-vacuuming). So, to get the actual row count, you need to look up the 'deletes' from sys.deltas('<schema>', '<table>'). For instance:
sql>create table tbl (id int, city string);
operation successful
sql>insert into tbl values (1, 'London'), (2, 'Paris'), (3, 'Barcelona');
3 affected rows
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 2 | Paris |
| 3 | Barcelona |
+------+-----------+
3 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 0 |
| 15570 | 0 |
+-------+---------+
2 tuples
After we delete one row, the actual row count is sys.storage.count - sys.deltas ('sys', 'tbl').deletes:
sql>delete from tbl where id = 2;
1 affected row
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 3 | Barcelona |
+------+-----------+
2 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 1 |
| 15570 | 1 |
+-------+---------+
2 tuples
After we insert a new row, the deleted row is overwritten:
sql>insert into tbl values (4, 'Praag');
1 affected row
sql>select * from tbl;
+------+-----------+
| id | city |
+======+===========+
| 1 | London |
| 4 | Praag |
| 3 | Barcelona |
+------+-----------+
3 tuples
sql>select schema, table, column, count from sys.storage where table='tbl';
+--------+-------+--------+-------+
| schema | table | column | count |
+========+=======+========+=======+
| sys | tbl | city | 3 |
| sys | tbl | id | 3 |
+--------+-------+--------+-------+
2 tuples
sql>select id, deletes from sys.deltas ('sys', 'tbl');
+-------+---------+
| id | deletes |
+=======+=========+
| 15569 | 0 |
| 15570 | 0 |
+-------+---------+
2 tuples
So, the formula to compute the actual row count (sys.storage.count - sys.deltas ('sys', 'tbl').deletes) is generally applicable. sys.deltas() keeps stats for every column of a table, but the count and deletes are table wide, so you only need to check one column.

Related

Oracle UNION ALL query takes temp space

I have a query like this:
SELECT * FROM TEST1 LEFT OUTER JOIN TEST2 on TEST1.ID=TEST2.ID
UNION ALL
SELECT * FROM TEST3 LEFT OUTER JOIN TEST4 on TEST3.ID=TEST4.ID;
The behavior I see here is, it first join TEST1 and TEST2 tables (billions of rows) and then stores the output in temp tablespace. Then it joins TEST3 and TEST4 and then saves the output in same temp table. And finally select the records from there to display the result.
This behavior I see in both Redshift and Oracle. I was just wondering why it stores the result in temporary segments after getting result from first SELECT. It's time taking as well as eats up the temp space. Can not it just starts displaying the result after 1st SELECT is finishes and then goes for 2nd one (instead of storing).
This answer is somewhat speculative, because I don't have an Oracle doc reference. By inspection, we can imagine instead that you wanted to run the following query:
SELECT * FROM TEST1 JOIN TEST2
UNION ALL
SELECT * FROM TEST3 JOIN TEST4
ORDER BY some_col;
It should be clear that to apply any set operation like ORDER BY, all the records returned from the union query would need to be in one logical place. A temp table would seem to work.
That you are not using ORDER BY appears to not affect the workflow which Oracle is using.
I can also add another reason why Oracle is insisting on using a temp table here. Suppose it would be possible to write both halves of the union directly to the buffer. But what would happen if, at a later date, the size of the total union query suddenly exceeded what the buffer can hold? The answer is that your database would crash. So, using a temp table is a safe bet which should generally always work.
How do you observe this behaviour? By any chance don't you perform INSERT or CREATE TABLE? That would explain your observation, because at the end, all rows are required.
Also if your client has set an option fetch all rows this could be observed.
But in normal case, where the client is interested in few first rows Oracle returns quickly the first available (array size) rows from the first join ignoring the second one.
You may perform this little Gedankenexperiment:
create table test1 as
select rownum id,
lpad('x',1023,'X') pad
from dual connect by level <= 1000000;
Create analog the table 2 to 4.
Now run your query (adapted to valid syntax)
SELECT * FROM TEST1 CROSS JOIN TEST2
UNION ALL
SELECT * FROM TEST3 CROSS JOIN TEST4;
This returns for my the first page in SQL Developer in ca 30 seconds, which somehow disproves your claim.
Simple calculate the required TEMP space for two 10**6 * 10**6 cartesian join with row lenth 1K - this is far above my TEMP configuration.
The one possible way to observe what is Oracle actualy doing is to run the query with the /*+ gather_plan_statistics */ hint.
Than get the SQL_ID of the statement and check the actual rows A-Rowsin the plan
select * from table(dbms_xplan.display_cursor('a9y62gxagups6',null,'ALLSTATS LAST'));
SQL_ID a9y62gxagups6, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */ * FROM TEST1 CROSS JOIN TEST2
UNION ALL SELECT * FROM TEST3 CROSS JOIN TEST4
Plan hash value: 1763392637
--------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 1 | UNION-ALL | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 2 | MERGE JOIN CARTESIAN| | 1 | 1000G| 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 3 | TABLE ACCESS FULL | TEST1 | 1 | 1000K| 1 |00:00:00.02 | 4 | 28 | 0 | | | |
| 4 | BUFFER SORT | | 1 | 1000K| 50 |00:00:28.49 | 166K| 166K| 142K| 1255M| 11M| 97M (0)|
| 5 | TABLE ACCESS FULL | TEST2 | 1 | 1000K| 1000K|00:00:03.66 | 166K| 166K| 0 | | | |
| 6 | MERGE JOIN CARTESIAN| | 0 | 1000G| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 7 | TABLE ACCESS FULL | TEST3 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 8 | BUFFER SORT | | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | 1103M| 10M| |
| 9 | TABLE ACCESS FULL | TEST4 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
--------------------------------------------------------------------------------------------------------------------------------------
You see, that Oracle
1) full scanned the table2 (row 5)
2) get one row from table1 (row 3)
3) to return to frist 50 rows (row 0)
4) tables 3 and 4 are untached (rows 7 and 9)
You may simple adapt the example to you inner join to see similar results.

Convert raw query into laravel eloquent

I have this written and working as a raw SQL query, but I am trying to convert it to a more Laravel eloquent / query builder design instead of just a raw query.
My table structure like this:
Table One (Name model)
______________
| id | name |
|------------|
| 1 | bob |
| 2 | jane |
--------------
Table Two (Date Model)
_________________________________
| id | table_1_id | date |
|-------------------------------|
| 1 | 1 | 2000-01-01 |
| 2 | 1 | 2000-01-31 |
| 4 | 1 | 2000-02-28 |
| 5 | 1 | 2000-03-03 |
| 6 | 2 | 2000-01-03 |
| 7 | 2 | 2000-01-05 |
---------------------------------
I am returning only the the highest (most recent) dates from table 2 (Dates model) that match the user bob from table 1 (Name model).
For instance, in the example above, I return this from my query
2000-01-31
2000-02-28
2000-03-03
Here is what I am doing now (which works), but i'm just not sure how to use YEAR, MONTH and MAX with laravel.
DB::select(
DB::raw("
SELECT MAX(date) as max_date
FROM table_2
INNER JOIN table_1 ON table_1.id = table_2.table_1_id
WHERE table_1.name = 'bob'
GROUP BY YEAR(date), MONTH(date)
ORDER BY max_date DESC
")
);
Try this code if any problem then,
DB::table('table_1')->join('table_2', 'table_1.id','=','table_2.table_1_id')
->select(DB::raw('MAX(date) as max_date'),DB::raw('YEAR(date) year, MONTH(date) month'),'table_1.name')
->where('name','bob')
->groupBy('year','month')
->orderBy('max_date')
->get();
If any problem with above code then feel free to ask.

Improving SQL Exists scalability

Say we have two tables, TEST and TEST_CHILDS in the following way:
creat TABLE TEST(id1 number PRIMARY KEY, word VARCHAR(50),numero number);
creat TABLE TEST_CHILD (id2 number references test(id), word2 VARCHAR(50));
CREATE INDEX TEST_IDX ON TEST_CHILD(word2);
CREATE INDEX TEST_JOIN_IDX ON TEST_CHILD(id);
insert into TEST SELECT ROWNUM,U1.USERNAME||U2.TABLE_NAME, LENGTH(U1.USERNAME) FROM ALL_USERS U1,ALL_TABLES U2;
INSERT INTO TEST_CHILD SELECT MOD(ROWNUM,15000)+1,U1.USER_ID||U2.TABLE_NAME FROM ALL_USERS U1,ALL_TABLES U2;
We would like to query to get rows from TEST table that satisfy some criteria in the child table, so we go for:
SELECT /*+FIRST_ROWS(10)*/* FROM TEST T WHERE EXISTS (SELECT NULL FROM TEST_CHILD TC WHERE word2 like 'string%' AND TC.id = T.id ) AND ROWNUM < 10;
We always want just the first 10 results, not any more at all. Therefore, we would like to get the same response time to read 10 results whether table has 10 matching values or 1,000,000; since it could get 10 distinct results from the child table and get the values on the parent table (or at least that is the plan that we would like). But when checking the actual execution plan we see:
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 54 | 5 (20)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | NESTED LOOPS | | | | | |
| 3 | NESTED LOOPS | | 1 | 54 | 5 (20)| 00:00:01 |
| 4 | SORT UNIQUE | | 1 | 23 | 3 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| TEST_CHILD | 1 | 23 | 3 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | TEST_IDX | 1 | | 2 (0)| 00:00:01 |
|* 7 | INDEX UNIQUE SCAN | SYS_C005145 | 1 | | 0 (0)| 00:00:01 |
| 8 | TABLE ACCESS BY INDEX ROWID | TEST | 1 | 31 | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<10)
6 - access("WORD2" LIKE 'string%')
filter("WORD2" LIKE 'string%')
7 - access("TC"."ID"="T"."ID")
SORT UNIQUE under the STOPKEY, what afaik means that it is reading all results from the child table, making the distinct to finally select only the first 10, making the query not as scalable as we would like it to be.
Is there any mistake in my example?
Is it possible to improve this execution plan so it scales better?
The SORT UNIQUE is going to find and sort all of the records from TEST_CHILD that matched 'string%' - it is NOT going to read all results from child table. Your logic requires this. IF you only picked the first 10 rows from TEST_CHILD that matched 'string%', and those 10 rows all had the same ID, then your final results from TEST would only have 1 row.
Anyway, your performance should be fine as long as 'string%' matches a relatively low number of rows in TEST_CHILD. IF your situation is such that 'string%' often matches a HUGE record count on TEST_CHILD, there's not much you can do to make the SQL more performant given the current tables. In such a case, if this is a mission-critical SQL, with performance tied to your annual bonus, there's probably some fancy footwork you could do with MATERIALIZED VIEWs to, e.g. pre-compute 10 TEST rows for high-cardinality WORD2 values in TEST_CHILD.
One final thought - a "risky" solution, but one which should work if you don't have thousands of TEST_CHILD rows matching the same TEST row, would be the following:
SELECT *
FROM TEST
WHERE ID1 IN
(SELECT ID2
FROM TEST_CHILD
WHERE word2 like 'string%'
AND ROWNUM < 1000)
AND ROWNUM <10;
You can adjust 1000 up or down, of course, but if it's too low, you risk finding less than 10 distinct ID values, which would give you final results with less than 10 rows.

Joining tables with same column names - ORACLE

I am using Oracle.
I am currently working one 2 tables which both have the same column names. Is there any way in which I can combine the 2 tables together as they are?
Simple example to show what I mean:
TABLE 1:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| a | 1 | w |
| b | 2 | x |
TABLE 2:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| c | 3 | y |
| d | 4 | z |
RESULT THAT I WANT:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| a | 1 | w |
| b | 2 | x |
| c | 3 | y |
| d | 4 | z |
Any help would be greatly appreciated. Thank you in advance!
You can use the union set operator to get the result of two queries as a single result set:
select column1, column2, column3
from table1
union all
select column1, column2, column3
from table2
union on its own implicitly removes duplicates; union all preserves them. More info here.
The column names don't need to be the same, you just need the same number of columns with the same datatpes, in the same order.
(This is not what is usually meant by a join, so the title of your question is a bit misleading; I'm basing this on the example data and output you showed.)

LISTAGG function with two columns

I have one table like this (report)
--------------------------------------------------
| user_id | Department | Position | Record_id |
--------------------------------------------------
| 1 | Science | Professor | 1001 |
| 1 | Maths | | 1002 |
| 1 | History | Teacher | 1003 |
| 2 | Science | Professor | 1004 |
| 2 | Chemistry | Assistant | 1005 |
--------------------------------------------------
I'd like to have the following result
---------------------------------------------------------
| user_id | Department+Position |
---------------------------------------------------------
| 1 | Science,Professor;Maths, ; History,Teacher |
| 2 | Science, Professor; Chemistry, Assistant |
---------------------------------------------------------
That means I need to preserve the empty space as ' ' as you can see in the result table.
Now I know how to use LISTAGG function but only for one column. However, I can't exactly figure out how can I do for two columns at the sametime. Here is my query:
SELECT user_id, LISTAGG(department, ';') WITHIN GROUP (ORDER BY record_id)
FROM report
Thanks in advance :-)
It just requires judicious use of concatenation within the aggregation:
select user_id
, listagg(department || ',' || coalesce(position, ' '), '; ')
within group ( order by record_id )
from report
group by user_id
i.e. aggregate the concatentation of department with a comma and position and replace position with a space if it is NULL.

Resources