Does MonetDB support parallel query execution when using merge tables? - parallel-processing

I have partitioned my data set into two separate sets of 5M rows. Each partition is loaded into a table on a machine of its own. I use a central monetdb instance where I register both tables as remote tables and add them to a merge table.
When I run a query on the merge table I would expect MonetDB to distribute the query, in parallel, to both partition tables. However, when looking at results created with tomograph I see that each remote table is queried sequentially.
I've compiled MonetDB myself using a recent source tarball. I've disabled geom and made sure embedded python was available. Other than that I've not changed any settings or configure flags. The two machine holding the partitions are 1 core VMs with 4GB memory. The central machine is my laptop, which has 4 cores and 16GB of memory. I have also run this experiment using a central node with the same configuration as the partitions.
I created the tables like this:
-- On each partition (X = {1, 2}):
CREATE TABLE responses_pX (
r_id int primary key,
r_date date,
r_status tinyint,
age tinyint,
movie varchar(25),
score tinyint
);
-- On central node:
CREATE MERGE TABLE responses (
r_id int primary key,
r_date date,
r_status tinyint,
age tinyint,
movie varchar(25),
score tinyint
);
-- For both partitions
CREATE REMOTE TABLE responses_pX (
r_id int primary key,
r_date date,
r_status tinyint,
age tinyint,
movie varchar(25),
score tinyint
) ON 'mapi:monetdb://partitionX:50000/partitionX';
ALTER TABLE responses ADD TABLE responses_pX;
I'm running the following queries on the central node:
SELECT COUNT(*) FROM responses;
SELECT COUNT(*), SUM(score) FROM responses;
SELECT r_date, age, SUM(score)/COUNT(score) as avg_score FROM responses GROUP BY r_date, age;
For all queries the parallelism reported by the tomograph tool is no higher than 2.11%.

yes, MonetDB uses parallel processing where possible. See the documentation
https://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/DistributedQueryProcessing

Related

Postgres primary key 'less than' operation is slow

Consider the following table
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL
);
If we have 100 million random data in this table.
Select age from company where id=2855265
Executed in less than a millisecond
Select age from company where id<353
Return less than 50 rows and Executed in less than a millisecond
Both query uses index
But the following query use full table scan and executed in 3 seconds
Select age from company where id<2855265
Return less than 500 rows
How can I speed up the query that select primary key less than variable?
Performance
The predicate id < 2855265 potentially returns a large percentage of rows in the table. Unless Postgres has information in table statistics to expect only around 500 rows, it might switch from an index scan to a bitmap index scan or even a sequential scan. Explanation:
Postgres not using index when index scan is much better option
We would need to see the output from EXPLAIN (ANALYZE, BUFFERS) for your queries.
When you repeat the query, do you get the same performance? There may be caching effects.
Either way, 3 seconds is way to slow for 500 rows, Postgres might be working with outdated or inexact table statistics. Or there may be issues with your server configuration (not enough resources). Or there can be several other not so common reasons, including hardware issues ...
If VACUUM ANALYZE did not help, VACUUM FULL ANALYZE might. It effectively rewrites the whole table and all indexes in pristine condition. Takes an exclusive lock on the table and might conflict with concurrent access!
I would also consider increasing the statistics target for the id column. Instructions:
Keep PostgreSQL from sometimes choosing a bad query plan
Table definition?
Whatever else you do, there seem to be various problems with your table definition:
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL, -- int is probably enough. "id" is a terrible column name
NAME TEXT NOT NULL, -- "name" is a teriible column name
AGE INT NOT NULL, -- typically bad idea to store age, store birthday instead
ADDRESS CHAR(50), -- never use char(n)!
SALARY REAL -- why would a company have a salary? never store money as real
);
You probably want something like this instead:
CREATE TABLE emmployee(
emploee_id serial PRIMARY KEY
company_id int NOT NULL -- REFERENCES company(company_id)?
, birthday date NOT NULL
, employee_name text NOT NULL
, address varchar(50) -- or just text
, salary int -- store amount as *Cents*
);
Related:
How to implement a many-to-many relationship in PostgreSQL?
Any downsides of using data type "text" for storing strings?
You will need to perform a VACUUM ANALYZE company; to update the planning.

How to store range/hash composite partitions in separate datafiles by range?

I'm creating a database which will utilize composite partitioning. I will partition one table using range partitioning (by date)
and then further subpartition it by hash (by client id). So far so good, no problem, but I also need to have those partitions
stored in separate data files each dbf holding data for a single month. I'm reading on composite partitions and what I found
is that primary range partitioning will be only a logical one and data will be stored in subpartitions instead which seems to
make my goal impossible. Am I right and should look for a different solution?
Thanks in advance.
My databases are Oracle 11g and Oracle 12
On existing table you can move partitions or subpartitions to a different tablespace, i.e. different datafile, examples:
ALTER TABLE scuba_gear MOVE SUBPARTITION bcd_types TABLESPACE tbs23;
ALTER TABLE parts MOVE PARTITION depot2 TABLESPACE ts094;
see Moving Subpartitions and Moving Table Partitions
For new tables typically you would be create them like this:
CREATE TABLE sales
( prod_id NUMBER(6)
, cust_id NUMBER
, time_id DATE
, channel_id CHAR(1)
, promo_id NUMBER(6)
, quantity_sold NUMBER(3)
, amount_sold NUMBER(10,2)
)
PARTITION BY RANGE (time_id)
INTERVAL (NUMTOYMINTERVAL(1,'MONTH'))
STORE IN (ts_1, ts_2, ts_3, ts_4, ts_5, ts_6 ,ts_7 ,ts_8, ts_9, ts_10, ts_11, ts_12)
SUBPARTITION BY HASH (cust_id) SUBPARTITIONS 4
( PARTITION before_2000 VALUES LESS THAN (TO_DATE('01-JAN-2000','dd-MON-yyyy')));
Oracle then will put the monthly partitions by "round-robin" method to these 12 tablespaces. STORE IN clause is also possible for subpartitions, see Creating a composite range-hash partitioned table

Oracle subpartition by reference: not supported, can it be emulated?

Given an address table with millions of entries, partitioned by state_cd:
CREATE TABLE ADDRESSES
(
ADDRESS_ID INT NOT NULL PRIMARY KEY,
STATE_CD VARCHAR2(2) NOT NULL,
CITY VARCHAR2(100) NOT NULL
)
PARTITION BY LIST(STATE_CD)
(PARTITION P_AL VALUES ('AL'),...., PARTITION P_WY VALUES ('WY'));
And a measurements table with millions of measurements per day, range-partitioned by date:
CREATE TABLE MEASUREMENTS
(
MEASUREMENT_ID INT NOT NULL PRIMARY KEY,
MEASUREMENT_DT DATE NOT NULL,
ADDRESS_ID INT NOT NULL,
MEASUREMENT NUMBER NOT NULL,
CONSTRAINT MEASUREMENTS_FK1 FOREIGN KEY (ADDRESS_ID)
REFERENCES ADDRESSES (ADDRESS_ID)
)
PARTITION BY RANGE (MEASUREMENT_DT)
(PARTITION P_20150101 VALUES LESS THAN (DATE '2015-01-02'),...);
Many queries would be greatly improved with partition-wise joining of MEASUREMENTS and ADDRESSES, e.g.:
SELECT TRUNC(MEASUREMENT_DT) MNTH,
CITY,
AVG(MEASUREMENT)
FROM MEASUREMENTS
JOIN ADDRESSES USING (ADDRESS_ID)
GROUP BY TRUNC(MEASUREMENT_DT), CITY;
However, adding STATE_CD to MEASUREMENTS is an unacceptable violation of normal form (introducing entirely new performance issues, e.g. ADDRESSES JOIN MEASUREMENTS USING (ADDRESS_ID, STATE_CD) really messes with the CBO's cardinality estimates).
Optimal solution would be sub-partitioning by reference, which Oracle does not support: CREATE TABLE MEASUREMENTS ... PARTITION BY RANGE(MEASUREMENT_DT) SUBPARTITION BY REFERENCE (MEASUREMENTS_FK1).
This seems like it'd be a fairly straightforward application of reference partitioning. But not only is the syntax not supported; I wasn't able to find a lot of forum activity clamoring for such a feature. This led to my posting here: typically when I am looking for a feature that no one else is looking for, it means I've overlooked an equivalent (or potentially even better) alternative.

Best way to design for one-to-many code values in Oracle

We receive several millions of records per day on temperature metrics. Most of the pertinent metadata for these records is maintained in a single partitioned table by date (month). We are going to start receiving up to 20 individual codes associated with this data. The intent is to ultimately allow searching by these codes.
What would be the most effective way to store this data to minimize search response time? The database is Oracle 11gR2.
Some options I was taking into consideration:
(1) Create a separate table with main record id and code values. Something like
id code
-- ----
1 AA
1 BB
1 CC
2 AA
2 CC
2 DD
Concerns:
would probably require a bitmap index on the code column, but the table is highly transactional so no bitmap
table would get huge over time with up to 20 codes per main record
(2) Create a separate table partitioned by code values.
Concerns:
partition maintenance for new codes
search performance
(3) Add a XMLType column to existing table and format the codes for each record into XML and create an XMLIndex on the column: Something like:
<C1>AA</C1>
<C2>BB</C2>
<C3>CC</C3>
Concerns:
query response time when searching on CODE probably would be poor
Any recommendations are welcome.
Thanks.
You need to benchmark different approaches. There's no way we can give you meaningful solutions without knowing much more about your scenario. How many different codes will there be in total? What's the average number of codes per reading? Will there be a noticeable skew in the distribution of codes? What access paths do you need to support for searching by code?
Then there's the matter of how you load data (batches? drip feed?). And what benefits you derive from using partitioning.
Anyway. Here's one more approach, which is an amalgamation of your (1) and (2).
Given that your main table is partitioned by month you should probably partition any child table with the same scheme. You can subpartition by code as well.
create table main_codes
( reading_dt date not null
, main_id number not null
, code varchar2(2)
, constraint main_codes_pk primary key (code, reading_dt, main_id) using index local
)
partition by range (reading_dt)
subpartition by list (code)
subpartition template
(
subpartition sp_aa values ( 'AA' ),
subpartition sp_bb values ( 'BB' ),
subpartition sp_cc values ( 'CC' ),
subpartition sp_dd values ( 'DD' )
)
(
partition p_2015JAN values less than (date '2015-02-01' ),
partition p_2015FEB values less than (date '2015-03-01' ),
partition p_2015MAR values less than (date '2015-04-01' )
)
/
You'll probably want a foreign on the main table too:
alter table main_codes
add constraint code_main_fk foreign key (reading_dt, main_id)
references main_table (reading_dt, main_id)
/
create index code_main_idx on main_codes (entry_dt, id) local
/
Depending on the number of codes you have, creating the subpartition template could be tedious. This is why Nature gave us cut'n'paste.
But whatever you do, don't go down the XML path.

Oracle Partition by 2 columns Interval

Im trying to find on google my situation but no one talk about this situation.
i have a table thats is gonna be partitionized with 2 columns.
for 2 columns partitions can anyone show an example for the interval?
In this case i have only one.
For this example how do i use an interval with 2 columns
INTERVAL( NUMTODSINTERVAL(1,'DAY'))
My table:
create table TABLE_TEST
(
PROCESS_DATE DATE GENERATED ALWAYS AS (TO_DATE(SUBSTR("CHARGE_DATE_TIME",1,10),'yyyymmdd')),
PROCESS_HOUR VARCHAR(10) GENERATED ALWAYS AS (SUBSTR("CHARGE_DATE_TIME",12,2)),
ANUM varchar(100),
SWTICH_DATE_TIME varchar(100),
CHARGE_DATE_TIME varchar(100),
CHARGE varchar(100),
)
TABLESPACE TB_LARGE_TAB
PARTITION BY RANGE (PROCESS_DATE, PROCESS_HOUR)
INTERVAL( NUMTODSINTERVAL(1,'DAY'))
Many Thanks,
Macieira
You can't use an interval if your range has more than one column; you'd get: ORA-14750: Range partitioned table with INTERVAL clause has more than one column. From the documentaion:
You can specify only one partitioning key column, and it must be of NUMBER, DATE, FLOAT, or TIMESTAMP data type.
I'm not sure why you're splitting the date and hour out into separate columns (since a date has a time component anyway), or why you're storing the 'real' date and number values as strings; it would be much simpler to just have columns with the correct data types in the first place. But assuming you are set on storing the data that way and need the separate process_date and process_hour columns as you have them, you can add a third virtual column that combines them:
create table TABLE_TEST
(
PROCESS_DATE DATE GENERATED ALWAYS AS (TO_DATE(SUBSTR(CHARGE_DATE_TIME,1,10),'YYYYMMDD')),
PROCESS_HOUR VARCHAR2(8) GENERATED ALWAYS AS (SUBSTR(CHARGE_DATE_TIME,12,2)),
PROCESS_DATE_HOUR DATE GENERATED ALWAYS AS (TO_DATE(CHARGE_DATE_TIME, 'YYYYMMDDHH24')),
ANUM VARCHAR2(100),
SWTICH_DATE_TIME VARCHAR2(100),
CHARGE_DATE_TIME VARCHAR2(100),
CHARGE VARCHAR2(100)
)
PARTITION BY RANGE (PROCESS_DATE_HOUR)
INTERVAL (NUMTODSINTERVAL(1,'DAY'))
(
PARTITION TEST_PART_0 VALUES LESS THAN (DATE '1970-01-01')
);
Table table_test created.
I've also changed your string data types to varchar2 and added a made-up initial partition. process_hour probably wants to be a number type, depending on how you'll use it. As I don't know why you're choosing your current data types it's hard to tell what would really be more appropriate.
I don't really understand why you'd want the partition range to be hourly and the interval to be one day though, unless you want the partitions to be from, say, midday to midday; in which case the initial partition (test_part_0) would have to specify that time, and your range specification is still wrong for that.
Interval partitioning could be built only on one column.
In your case you have proper partition key column - CHARGE_DATE_TIME. Why do you create virtual columns as VARCHAR2? And why do you need to create partition key on them? Interval partitioning could be built only on NUMBER or DATE columns.

Resources