I see the following error in vertica.log:
2016-09-01 15:30:54.007 TM Moveout:0x7f9438012440-a00000001212c3 [Txn] <INFO> Begin Txn: a00000001212c3 'Moveout: Tuple Mover'
2016-09-01 15:30:54.007 TM Moveout:0x7f9438012440-a00000001212c3 [TM] <INFO> Tuple Mover: moving out projection rosing_epg_program_events_super
2016-09-01 15:30:54.017 TM Moveout:0x7f9438012440-a00000001212c3 [EE] <INFO> (a00000001212c3) Moveout projection staging.rosing_epg_program_events_super
2016-09-01 15:30:54.017 TM Moveout:0x7f9438012440-a00000001212c3 [EE] <INFO> (a00000001212c3) TM Moveout: moving out data in WOS for proj "staging.rosing_epg_program_events_super" to epoch 3061
2016-09-01 15:30:54.017 TM Moveout:0x7f9438012440-a00000001212c3 [EE] <INFO> (a00000001212c3) Executing the moveout plan
2016-09-01 15:30:54.040 TM Moveout:0x7f9438012440-a00000001212c3 [EE] <INFO> SortManager found maxMerges 7 too small(64 MB Assigned).
2016-09-01 15:30:54.040 TM Moveout:0x7f9438012440-a00000001212c3 [EE] <INFO> After disabling optimization, maxMerges becomes 15.
2016-09-01 15:30:54.069 TM Moveout:0x7f9438012440-a00000001212c3 [Txn] <INFO> Rollback Txn: a00000001212c3 'Moveout: (Table: staging.rosing_epg_program_events) (Projection: staging.rosing_epg_program_events_super)'
2016-09-01 15:30:54.070 TM Moveout:0x7f9438012440 <LOG> #v_statistic_node0001: 00000/3298: Event Posted: Event Code:14 Event Id:261 Event Severity: Warning [4] PostedTimestamp: 2016-09-01 16:30:54.069887 ExpirationTimestamp: 2016-09-01 16:31:09.069887 EventCodeDescription: Timer Service Task Error ProblemDescription: threadShim: Too many data partitions DatabaseName: statistic Hostname: rosing-vertica.elt.stag.local
2016-09-01 15:30:54.070 TM Moveout:0x7f9438012440 <ERROR> #v_statistic_node0001: {threadShim} 54000/5060: Too many data partitions
HINT: Verify that the table partitioning expression is correct
LOCATION: handlePartitionKey, /scratch_a/release/16125/vbuild/vertica/EE/Operators/DataTarget.cpp:1478
2016-09-01 15:30:54.070 TM Moveout:0x7f9438012440 [Util] <INFO> Task 'TM Moveout' enabled
Seems like I choose wrong field for partitioning and reached limit of partitions in WOS as described here.
Task SELECT do_tm_task('moveout'); raise the following error:
Task: moveout
(Table: staging.rosing_schema_migrations) (Projection: staging.rosing_schema_migrations_super)
...
(Table: staging.rosing_epg_program_events) (Projection: staging.rosing_epg_program_events_super)
On node v_statistic_node0001:
ERROR 5060: Too many data partitions
(1 row)
Anybody know how to fix this problem?
Update:
I can't remove partitioning from this table:
ALTER TABLE rosing_epg_program_events REMOVE PARTITIONING
because this SQL raise the same error: Too many data partitions
UPDATE 2
I fixed this problem use woot answer. Thank you a lot!
Here is my steps for fix it:
Create copy of rosing_epg_program_events table:
CREATE TABLE staging.rosing_epg_program_events2
LIKE staging.rosing_epg_program_events;
Remove partitioning from new table:
ALTER TABLE staging.rosing_epg_program_events2 REMOVE PARTITIONING;
Copy data from old to new table. Seems like old table contains all (!) data inserted before and after appear problem:
INSERT /*+ DIRECT */ INTO staging.rosing_epg_program_events2
SELECT * FROM staging.rosing_epg_program_events;
Drop old table:
DROP TABLE staging.rosing_epg_program_events;
Rename new table:
ALTER TABLE staging.rosing_epg_program_events2 RENAME TO rosing_epg_program_events;
Run Moveout operation for any case. Now it works fine:
SELECT do_tm_task('moveout');
Check last good epoch for any case. Now it show actual value:
SELECT GET_LAST_GOOD_EPOCH();
SELECT * FROM epochs WHERE epoch_number = 3064; // result of previous command
Seems like all works fine now.
Do a CREATE TABLE AS SELECT or CREATE TABLE LIKE INCLUDING PROJECTIONS, remove the partitions, then INSERT /*+ DIRECT */ SELECT to copy the data out and drop the table then rename. Also, when creating partitions, try to target a granularity somewhere under 40 partitions. You didn't specify, but if using a timestamp, use a formula to extract out a less granular value. For example, to do monthly, do:
EXTRACT (year FROM mydate) * 100 + EXTRACT (month FROM mydate)
You don't have to worry about using formulas in the partitioning for Vertica. It uses min/max values for the fields instead of direct matching on the partition key.
Related
The tile pretty much says it. I want to create a Materialized View whose "SELECT" clause SELECTs data from another Materialized View in Clickhouse. I have tried this. The SQL for "createion" fo the two views runs without an error. But upon runtime, the first view is populated, but the second one isn't.
I need to know if I am making a mistake in my SQL or this is just simply not possible.
Here's my two views:
CREATE MATERIALIZED VIEW IF NOT EXISTS production_gross
ENGINE = ReplacingMergeTree
ORDER BY (profile_type, reservoir, case_tag, variable_name, profile_phase, well_name, case_name,
timestamp) POPULATE
AS
SELECT profile_type,
reservoir,
case_tag,
is_endorsed,
toDateTime64(endorsement_date / 1000.0, 0) AS endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
well_name,
case_name,
asset_id,
toDateTime64(eoh / 1000, 0) as end_of_history,
toDateTime64(ts / 1000, 0) as timestamp,
value, -- AS rate, -- cubic meters per second rate for this month
value * dateDiff('second',
toStartOfMonth(subtractMonths(now(), 1)),
toStartOfMonth(now())) AS volume -- cubic meters volume for this month
FROM (
SELECT pp.profile_type AS profile_type,
trimBoth(splitByChar('-', case_name)[1]) AS reservoir,
JSONExtractString(cd.data, 'case_data', 'Tags$$Tag') AS case_tag,
JSONExtractString(cd.data, 'case_data', 'Tags$$Endorsed') AS is_endorsed,
-- Endorsement Data, is the timestamp when the user "endorsed" the case
JSONExtract(cd.data, 'case_data', 'Tags$$EndorsementDate', 'time_stamp', 'Int64') AS endorsement_date,
-- Endorsement Month is the month of year for which the case was actually endorsed
JSONExtractString(cd.data, 'case_data', 'Tags$$MonthTags') AS endorsed_for_month,
pp.variable_name AS variable_name,
JSONExtractString(pp.data, 'profile_phase') AS profile_phase,
JSONExtractString(wd.data, 'name') AS well_name,
JSONExtractString(cd.data, 'header', 'name') AS case_name,
-- We might want to have asset id here to use in roll-up
JSONExtract(cd.data, 'header', 'reservoir_asset_id', 'Int64') AS asset_id, -- Asset Id in ARM
JSONExtract(pp.data, 'end_of_history', 'Int64') AS end_of_history,
JSONExtract(pp.data, 'values', 'Array(Float64)') AS values,
JSONExtract(pp.data, 'timestamps', 'Array(Int64)') AS timestamps,
JSONExtract(pp.data, 'end_of_history', 'Int64') AS eoh
FROM production_profile AS pp
INNER JOIN well_data AS wd ON wd.uuid = pp.well_id
INNER JOIN case_data AS cd ON cd.uuid = pp.case_id
)
ARRAY JOIN
values AS value,
timestamps AS ts
;
CREATE MATERIALIZED VIEW IF NOT EXISTS production_volume_actual
ENGINE = ReplacingMergeTree
ORDER BY (asset_id,
case_tag,
variable_name,
endorsement_date) POPULATE
AS
SELECT profile_type,
case_tag,
is_endorsed,
endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
asset_id,
sum(volume) AS total_actual_volume
FROM production_gross
WHERE timestamp < end_of_history
GROUP BY profile_type,
case_tag,
is_endorsed,
endorsement_date,
endorsed_for_month,
variable_name,
profile_phase,
asset_id
ORDER BY asset_id ASC,
case_tag ASC,
variable_name ASC,
endorsement_date ASC
;
As you can see, the second view is an "aggregation" on the first, and that is why I need it. If I want to do the aggregation from scratch, a lot of processes has to be done twice.
Update:
I have tried to change the query to the following:
SELECT ...
FROM `.inner.production_gross`
...
Which did not help. This query resulted in the following error:
Code: 60. DB::Exception: Table default.`.inner.production_gross` doesn't exist.
Then, based on the comment by #DennyCrane and using this answer: https://stackoverflow.com/a/67709334/959156, I run this query:
SELECT
uuid,
name
FROM system.tables
WHERE database = 'default' AND engine = 'MaterializedView'
Which gave me the uuid of the inner table:
ebab2dc5-2887-4e7d-998d-6acaff122fc7
So, I ran this query:
SELECT ...
FROM `.inner.ebab2dc5-2887-4e7d-998d-6acaff122fc7`
Which resulted in the following error:
Code: 60. DB::Exception: Table default.`.inner.ebab2dc5-2887-4e7d-998d-6acaff122fc7` doesn't exist.
Materialized views work as insert triggers on actual data tables, so your production_volume_actual table has to do a SELECT on a data table, not a "view".
If you CREATE a materialized view using an ENGINE (and not as TO another data table), ClickHouse actually creates a data table with the name .inner.<mv_name> on older versions (not using an Atomic database engine), or .inner_id.<some UUID>. if using an Atomic or Replicated database engine. So if you change the select in your second view to this "inner" table name, either:
select from `.inner.production_gross`
select from `.inner_id.<UUID>` -- note the extra '_id' on 'inner'
It should work.
This answer can point you to the right UUID.
At ClickHouse we actually recommend you always create Materialized Views as TO <second_table> to avoid this kind of confusion, and to make operations on <second_table> simpler and more transparent.
(Thanks to OP Mostafa Zeinali and Denny Crane for the clarification for more recent ClickHouse versions)
I have two tables in Oracle and I have to synchronize values (Field column) between the tables. I'm using Informatica PowerCenter for this synchronization operation. The source qualifier query causes high I/O usage and I need to solve it.
Table1
Table1 has about 20M data. Field in Table1 is the actual field. Timestamp field holds create & update date and it has daily partition.
Id
Field
Timestamp
1
A
2017-05-12 03:13:40
2
B
2002-11-01 07:30:46
3
C
2008-03-03 03:26:29
Table2
Table2 has about 500M data. Field in Table2 should be as sync as possible to Field in Table1. Timestamp field holds create & update date and it has daily partition. Table2 is also target in the mapping.
Id
Table1_Id
Field
Timestamp
Action
100
1
A
2005-09-30 03:20:41
Nothing
101
1
B
2015-06-29 09:41:44
Update Field as A
102
1
C
2016-01-10 23:35:49
Update Field as A
103
2
A
2019-05-08 07:42:46
Update Field as B
104
2
B
2003-06-02 11:23:57
Nothing
105
2
C
2021-09-21 12:04:24
Update Field as B
106
3
A
2022-01-23 01:17:18
Update Field as C
107
3
B
2008-04-24 15:17:25
Update Field as C
108
3
C
2010-01-15 07:20:13
Nothing
Mapping Queries
Source Qualifier Query
SELECT *
FROM Table1 t1, Table2 t2
WHERE t1.Id = t2.Table1_Id AND t1.Field <> t2.Field
Update Transformation Query
UPDATE Table2
SET
Field = :tu.Field,
Timestamp = SYSDATE
WHERE Id = :tu.Id
You can use below approach.
SQ - Your SQL is correct and you can use it if you see its working but add a <> clause on partition date key column. You can use this SQL to speed it up as well.
SELECT *
FROM Table2 t2
INNER JOIN Table1 t3 ON t3.Id = t2.Table1_Id
LEFT OUTER JOIN Table1 t1 ON t1.Id = t2.Table1_Id AND t1.Field = t2.Field AND t1.partition_date= t2.partition_date -- You did not mention partition_date column but i am assuming there is a separate column which is used to partition.
WHERE t1.id is null -- <> is inefficient.
Then in your infa target T2 definition, make sure you mention partition_date as part of key along with ID.
Then use a update strategy set to DD_UPDATE. You can set the session to update as well.
And remove that target override. This actually applies the update query on the whole table and sometime can be inefficient abd I/O intensive.
Informatica is powerful to update data in bunch through update strategy. You can increase commit interval as per your performance.
You shouldn't try to update a 500M table in a single go using SQL. Yes, you can use PLSQL to update in a bunch.
I have a table with 17 billion rows. I want to delete some of those, present in another table.
I tried a delete statement, parallelized, that did not complete because the temp space wasn't enough.
delete /* + PARALLEL(a, 32) */
from a
where (a.key1, a.key2) in
(select /*+ PARALLEL(b, 16) */
key1,
key2
from b);
Then I tried create table as select that also failed because of the same reason.
create table a_temp parallel 32 nologging as
select /* + PARALLEL (a, 32) */
key1,
key2,
rest_of_data
from a
where (a.key1, a.key2) not in
(select key1, key2 from b);
A regular (without PARALLEL) delete was taking more than one day so I had to terminate it.
Is there a way to free temp space as it is not needed anymore, during delete execution?
Is there another way I could do this?
EDIT:
B has 173 million records, and almost 16 billion records have to be deleted (almost the whole table). There are no indexes on the table.
EDIT2:
The explain plan for the create table is as follows:
CREATE TABLE STATEMENT, GOAL = ALL_ROWS 6749420 177523935 10828960035
PX COORDINATOR
PX SEND QC (RANDOM) SYS :TQ10001 6740915 177523935 10828960035
LOAD AS SELECT (HYBRID TSM/HWMB) USER A_TEMP
OPTIMIZER STATISTICS GATHERING 6740915 177523935 10828960035
MERGE JOIN ANTI NA 6740915 177523935 10828960035
SORT JOIN 6700114 17752393472 745600525824
PX BLOCK ITERATOR 45592 17752393472 745600525824
TABLE ACCESS FULL USER A 45592 17752393472 745600525824
SORT UNIQUE 40802 173584361 3298102859
PX RECEIVE 5365 173584361 3298102859
PX SEND BROADCAST SYS :TQ10000 5365 173584361 3298102859
PX BLOCK ITERATOR 5365 173584361 3298102859
TABLE ACCESS FULL USER B 5365 173584361 3298102859
Thanks in advance
I made it work, using a different solution.
I created manually the a_temp table and did an insert with an APPEND PARALLEL hint. The temp space wasn't exceeded and the inserts performed perfectly.
Here is the code:
create table a_temp(..);
insert /* + APPEND PARALLEL(a_temp, 32) */
into a_temp(...)
select /* + PARALLEL(a, 32) */
(...)
from a
where not exists
(select /* + PARALLEL(b, 16) */
'1'
from b
where a.key1 = b.key1
and a.key2 = b.key2)
To solve this issue in the past, I have deleted in batches of ~1M at a time. After a lot of digging for a cleaner solution, a DBA insisted that I take this approach.
This was my workflow:
I used Python and the cx_Oracle module to read in the PK values for the to-be-deleted records, iteratively plugged them into an executemany call as bind variables, and committed after every iteration.
If you want to stick with a Parallel Execution Approach:
Remember to use ALTER SESSION ENABLE PARALLEL DML so that your merge or delete is executed in parallel too. Check out this great blog post that walks you through this:
https://dioncho.wordpress.com/2010/12/10/interpreting-parallel-merge-statement/
Im using a DataStax Community v 2.1.2-1 (AMI v 2.5) with preinstalled default settings+ increased read time out to 10sec here is the issue
create table simplenotification_ttl (
user_id varchar,
real_time timestamp,
insert_time timeuuid,
read boolean,
msg varchar, PRIMARY KEY (user_id, real_time, insert_time));
Insert Query:
insert into simplenotification_ttl (user_id, real_time, insert_time, read)
values ('test_3',14401440123, now(),false) using TTL 800;
For same 'test_3' I inserted 33,000 tuples. [This problem does not happen for 24,000 tuples]
Gradually i see
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
count
-------
15681
(1 rows)
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
count
-------
12737
(1 rows)
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
**errors={}, last_host=127.0.0.1**
I have experimented this many times even on different tables. Once this happens, even if i insert with same user_id and do a retrieval with limit 1. It times out.
I require TTL to work properly ie give count 0 after speculated time. How to solve this issue?
Thanks
[My other node related setup is using m3.large with 2 nodes EC2Snitch]
You're running into a problem where the number of tombstones (deleted values) is passing a threshold, and then timing out.
You can see this if you turn on tracing and then try your select statement, for example:
cqlsh> tracing on;
cqlsh> select count(*) from test.simple;
activity | timestamp | source | source_elapsed
---------------------------------------------------------------------------------+--------------+--------------+----------------
...snip...
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 23:36:59,324 | 172.31.0.85 | 123932
Scanned 1 rows and matched 1 | 23:36:59,325 | 172.31.0.85 | 124575
Timed out; received 0 of 1 responses for range 2 of 4 | 23:37:09,200 | 172.31.13.33 | 10002216
You're kind of running into an anti-pattern for Cassandra where data is stored for just a short time before being deleted. There are a few options for handling this better, including revisiting your data model if needed. Here are some resources:
The cassandra.yaml configuration file - See section on tombstone settings
Cassandra anti-patterns: Queues and queue-like datasets
About deletes
For your sample problem, I tried lowering the gc_grace_seconds setting to 300 (5 minutes). That causes the tombstones to be cleaned up more frequently than the default 10 days, but that may or not be appropriate based on your application. Read up on the implications of deletes and you can adjust as needed for your application.
On a new job I have to figure out how some database reporting scripts are working.
There is one table that is giving me some trouble. I see in existing scripts that it is a partitioned table.
My problem is that whatever query I run on this table returns me "no rows selected".
Here are some details about my investigation in this table:
Table size estimate
SQL> select sum(bytes)/1024/1024 Megabytes from dba_segments where segment_name = 'PPREC';
MEGABYTES
----------
45.625
Partitions
There are a total of 730 partitions on date range.
SQL> select min(PARTITION_NAME),max(PARTITION_NAME) from dba_segments where segment_name = 'PPREC';
MIN(PARTITION_NAME) MAX(PARTITION_NAME)
------------------------------ ------------------------------
PART20110201 PART20130130
There are several tablespaces and partitions are allocated in them
SQL> select tablespace_name, count(partition_name) from dba_segments where segment_name = 'PPREC' group by tablespace_name;
TABLESPACE_NAME COUNT(PARTITION_NAME)
------------------------------ ---------------------
REC_DATA_01 281
REC_DATA_02 48
REC_DATA_03 70
REC_DATA_04 26
REC_DATA_05 44
REC_DATA_06 51
REC_DATA_07 13
REC_DATA_08 48
REC_DATA_09 32
REC_DATA_10 52
REC_DATA_11 35
REC_DATA_12 30
Additional query:
SQL> select * from dba_segments where segment_name='PPREC' and partition_name='PART20120912';
OWNER SEGMENT_NAME PARTITION_NAME SEGMENT_TYPE TABLESPACE_NAME HEADER_FILE HEADER_BLOCK BYTES BLOCKS EXTENTS
----- ------------ -------------- --------------- --------------- ----------- ------------ ----- ------ -------
HIST PPREC PART20120912 TABLE PARTITION REC_DATA_01 13 475315 65536 8 1
INITIAL_EXTENT NEXT_EXTENT MIN_EXTENTS MAX_EXTENTS PCT_INCREASE FREELISTS FREELIST_GROUPS RELATIVE_FNO BUFFER_POOL
-------------- ----------- ----------- ----------- ------------ --------- --------------- ------------ -----------
65536 1 2147483645 13 DEFAULT
Tabespace usage
Here is a space summary (composite of dba_tablespaces, dba_data_files, dba_segments, dba_free_space)
TABLESPACE_NAME TOTAL_MEGABYTES USED_MEGABYTES FREE_MEGABYTES
------------------------------ --------------- -------------- --------------
REC_01_INDX 30,700 250 30,449
REC_02_INDX 7,745 7 7,737
REC_03_INDX 22,692 15 22,677
REC_04_INDX 15,768 10 15,758
REC_05_INDX 25,884 16 25,868
REC_06_INDX 27,992 16 27,975
REC_07_INDX 17,600 10 17,590
REC_08_INDX 18,864 11 18,853
REC_09_INDX 19,700 12 19,687
REC_10_INDX 28,716 16 28,699
REC_DATA_01 102,718 561 102,156
REC_DATA_02 24,544 3,140 21,403
REC_DATA_03 72,710 4 72,704
REC_DATA_04 29,191 2 29,188
REC_DATA_05 42,696 3 42,692
REC_DATA_06 52,780 323 52,456
REC_DATA_07 16,536 1 16,534
REC_DATA_08 49,247 3 49,243
REC_DATA_09 30,848 2 30,845
REC_DATA_10 49,620 3 49,616
REC_DATA_11 40,616 2 40,613
REC_DATA_12 184,922 123,435 61,486
The tablespace usage seems to confirm that this table is not empty, in fact its last tablespace (REC_DATA_12) seems pretty busy.
Existing scripts
What I find puzzling is that there are some PL/SQL stored procedures that seem to work on that table and get data out of it.
An example of such a stored procedure is as follows:
procedure FIRST_REC as
vpartition varchar2(12);
begin
select 'PART'||To_char(sysdate,'YYYYMMDD') INTO vpartition FROM DUAL;
execute immediate
'MERGE INTO FIRST_REC_temp a
USING (SELECT bno, min(trdate) mintr,max(trdate) maxtr
FROM PPREC PARTITION ('||vpartition||') WHERE route_id IS NOT NULL AND trunc(trdate) <= trunc(sysdate-1)
GROUP BY bno) b
ON (a.bno=b.bno)
when matched then
update set a.last_tr = b.maxtr
when not matched then
insert (a.bno,a.last_tr,a.first_tr)
values (b.bno,b.maxtr,b.mintr)';
commit;
However if I try using the same syntax manually on the table, here is what I get:
SQL> select count(*) from PPREC PARTITION (PART20120912);
COUNT(*)
----------
0
I have tried a few random partitions and I always get the same 0 count.
Summary
- I see a table that seems to contain data (space used, tablespaces, data files)
- The table is partitioned (one partition per day over a period of 730 days ending end of January 2013)
- Scripts are extracting data from that table somehow
Question
- My queries using PARTITION are all returning me "no rows selected". What am I doing wrong? How could I find out how to extract data from this table?
I suppose it's possible that some other process might be deleting the data, but without visiting your site there's no way for anyone here to tell if that might be so.
I don't see in your post that you mentioned the name of the partitioning DATE column, but based on the SQL you posted I'll assume it's TRDATE - if this is not correct, change TRDATE in the statement below to be the partitioning column.
That said, give this a try:
SELECT COUNT(*)
FROM PPREC
WHERE TRDATE >= TO_DATE('01-SEP-2012 00:00:00', 'DD-MON-YYYY HH24:MI:SS')
This assumes you should have data in this table from September. If you find data, great. If you don't - well, Back In The Day (when men were men, women were women, and computers were water-cooled :-) we had a little saying about memory on IBM mainframes:
1. If you can see it, and it's there, it's Real.
2. If you can't see it, but it's there, it's Protected.
3. If you can see it, but it's not there, it's Virtual.
4. If you can't see it, and it's not there, it's GONE!
:-)
Use of the PARTITION clause should be reserved for situations where you are experiencing a performance problem (note: guessing about what is or is not going to be a performance problem is not allowed. Until you've got a performance problem you don't have a performance problem. Over the years I've found that software spends a lot of execution time in the darndest places :-), and the usual fixes (adding indexes, deleting unnecessary data, human sacrifice, etc) haven't worked. Basically, write your queries normally and trust the database to get it right. (In the general case - always write the simplest code - and do the simplest thing - that could possibly work. 99+ percent of the time it will be fine. That allows you to spend your optimization time on the less-than-one-percent cases where simple isn't good enough - and most of the software you write or design will be simple and easy to understand).
Share and enjoy.