How to copy data to another table without overwrite existing columns - hadoop

I have 2 tables in my Amazon DynamoDB: Elements and Containers. Hierarchy is that one container can hold few elements.
So Elements look like: uuid, timestamp, container_id, data.
I want to aggregate data from all elements to corresponding container, example:
Elements:
| uuid | container_id | data |
| 1 | 1 | 100 |
| 2 | 1 | 150 |
| 3 | 2 | 100 |
So I want to get in Containers table:
| uuid | data |
| 1 | 250 |
| 2 | 100 |
So, using hive, I wrote script (that starts on EMR cluster):
CREATE EXTERNAL TABLE element (`uuid` string, `container_id ` bigint, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Elements", "dynamodb.column.mapping"="uuid:UUID,container_id:container_id,data:data");
CREATE EXTERNAL TABLE container (`uuid` string, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Containers", "dynamodb.column.mapping"="uuid:UUID,data:data");
INSERT INTO TABLE container SELECT container_id as `uuid` sum(`data`) as `data` FROM element WHERE container_id IS NOT NULL GROUP BY container_id;
And it works good, but now I need to write some additional data to Containers table, so it should be like uuid, data, another_data. But when I perform script above it overwrite all another_data (that are not listed in external table). I try a lot of variants, but can't find solution.

Ok, I've found an answer:
CREATE EXTERNAL TABLE element (`uuid` string, `container_id ` bigint, `data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Elements", "dynamodb.column.mapping"="uuid:UUID,container_id:container_id,data:data");
CREATE EXTERNAL TABLE container (`uuid` string, `data` double, `another_data` double) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES("dynamodb.table.name"="Containers", "dynamodb.column.mapping"="uuid:UUID,data:data,another_data:another_data");
INSERT INTO TABLE container SELECT element.`container_id` as `uuid` sum(element.`data`) as `data`, collect_set(container.`another_data`)[0] as `another_data` FROM element LEFT JOIN container ON (element.`container_id` = container.`uuid`) WHERE element.container_id IS NOT NULL GROUP BY element.container_id;

Related

Local index in Range interval partitioned table is not being used

I have a range interval partitioned table. It has 6 trillion data for 1 year.
CREATE TABLE eip.Meter_Read_Alert_test
(
Mfg_serial_num VARCHAR2(50 BYTE) ,
Channel_id NUMBER NOT NULL,
Read_time TIMESTAMP(0),
CONSTRAINT pk_Alert_test PRIMARY KEY (ID,channel_id,Read_time)
)
PARTITION BY RANGE (Read_time) INTERVAL(NUMTOYMINTERVAL(1, 'MONTH'))
(
PARTITION p1 VALUES less than('01-09-19 12:00:00.000000000 AM')
) ;
Local index created on below columns:
CREATE INDEX mfg_SNo_test_idx on eip.Meter_Read_Alert_test ( Mfg_serial_num ) tablespace SPRING_METER_READ Local ;
CREATE INDEX channel_ID_test_idx on eip.Meter_Read_Alert_test (Channel_ID) tablespace SPRING_METER_READ Local ;
CREATE INDEX ReadTime_test_idx on eip.Meter_Read_Alert_test (Read_Time) tablespace SPRING_METER_READ Local ;
Issue:
When i run below query, ReadTime_test_idx index is not getting used. Full table scan is happening.
select * from meter_read_alert_test
where read_time between '19-11-2019 12:00:00 AM' and '19-11-2019 11:00:00 PM';
Plan hash value: 2722527583
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9090K| 728M| 411K (6)| 00:00:17 | | |
| 1 | PARTITION RANGE ITERATOR| | 9090K| 728M| 411K (6)| 00:00:17 | KEY | KEY |
|* 2 | TABLE ACCESS FULL | METER_READ_ALERT_TEST | 9090K| 728M| 411K (6)| 00:00:17 | KEY | KEY |
-----------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("READ_TIME">=TO_TIMESTAMP('19-11-2019 12:00:00 AM') AND
"READ_TIME"<=TO_TIMESTAMP('19-11-2019 11:00:00 PM'))
Please suggest what is wrong here and how to fix it.
Oracle is not necessarily doing anything wrong here by using a full table scan instead of an index range scan. Full table scans are best when reading a large percentage of data, because full table scans can use a multi-block read and don't have to traverse a tree structure for every value, and if the index data is unordered an index read may have to retrieve all of the blocks from the table anyway.
While your query is only reading a small fraction of the overall data, it is reading a "large" percentage of data from the partition. Since the table is month partitioned, Oracle is using partition pruning to instantly eliminate the majority of data (you can see that in action with the "Key" start and stop partitions). Inside that partition, the query is reading about one day's worth of data, which is about 3% of the data in a partition. There's not a generic number that represents what a "large" percentage is, but there are many cases where 3% is better read with a full table scan than an index.
There's a possibility that Oracle is making a wrong guess here. You might want to try the query with an index hint like select /*+ index(meter_read_alert_test) */ .... If that improves the performance, first try regathering stats. You shouldn't normally need to use index hints.

Force use of primary key in Oracle during search

I have scenario where i need to search & display records from huge tables with lots of rows. I have pre-defined search criteria for my tables for which user can provide the filter & click search .
Considering a sample table :
CREATE TABLE suppliers
( supplier_name varchar2(50) NOT NULL,
address varchar2(50),
city varchar2(50) NOT NULL,
state varchar2(25),
zip_code varchar2(10),
CONSTRAINT "suppliers_pk" PRIMARY KEY (supplier_name, city)
);
INSERT INTO suppliers VALUES ('ABCD','XXXX','YYYY','ZZZZ','95012');
INSERT INTO suppliers VALUES ('EFGH','MMMM','NNNN','OOOO','95010');
INSERT INTO suppliers VALUES ('IJKL','EEEE','FFFF','GGGG','95009');
I have provided the user with search fields as the primary key - supplier_name, city
If he enters both the fields, my query performance will be good since it goes for index scan
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' and city = 'ZZZZ';
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 102 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| SUPPLIERS | 1 | 102 | 1 (0)| 00:00:01 |
|* 2 | INDEX UNIQUE SCAN | suppliers_pk | 1 | | 1 (0)| 00:00:01 |
However, if he enters only one of the search field, my query performance will go bad since it goes for full table scan
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' ;
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 102 | 3 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| SUPPLIERS | 1 | 102 | 3 (0)| 00:00:01 |
Is there a way to force oracle to think it is a primary key search when i don't have all of the key fields in search , something like below ( which obviously is not working )
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' and city = city;
Thanks.
You are thinking about this in the wrong way.
The query optimiser will choose what it thinks best execution plan for the query based on the information available at the time the query is parsed (or sometimes when the parameters changed). Generally - if you give it the right information in terms of stats etc, it usually will do a good job.
You might think that you know better than it, but remember that you won't be monitoring this for the life of the database. The data changes, you want the database to be able to react and change the execution plan when it needs to.
That said, if you are set on forcing it to use the index, you can use a hint:
SELECT /*+ INDEX(suppliers suppliers_pk) */
supplier_name, address, city, state, zip_code FROM suppliers where
supplier_name = 'ABCD' ;
A full table scan is not necessarily bad. You have only a few rows in your table, so the optimizer thinks it is better to do a FTS than an index range scan. It will start using the PK index a soon as the RDBMS thinks it is better, i.e. you have lots a rows and the restriction on a certain supplier reduces the result significantly. If you want to search on city only instead of supplier you need another index with city only (or at least starting with city). Keep in mind that you might have to update the table statistics after you have loaded your table with bulk data. It is always important to test query performance with somehow realistic amounts of data.
Index is organised first on supplier_name second on city so it is not possible to use that index for query based on city only.
Please create second index based only on city. This will help your query.

Avoid full table scan on a table Oracle

I have a table which has 70 columns, Where primary key is the combination of 15 columns (which includes number and varchar2) . Please see below query
select * from tab1 where k1=1234567889;
Plan hash value: 1179808636
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6044 | 2201K| 4585K (1)| 15:17:04 |
|* 1 | TABLE ACCESS FULL| tab1 | 6044 | 2201K| 4585K (1)| 15:17:04 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 – filter ("K1"=30064825087)
Where tab1 is a table mentions above and k1 is a column which is part of primary key. Table is not partitioned. Table is also analyzed (table, index and columns) after data has been inserted. Output for above query returns like 100000 plus records. The problem is even after having PK on the k1 column, the query is doing full table scan, which is not acceptable. On the other hand using index hints does not really speed up the process.
Please advise what would be the possible solution.
For this query:
select *
from tab1
where k1 = 1234567889;
The best index is one that has k1 as the first key in the index. There can be a composite index, by k1 has to be the first key. It sounds like you have a composite primary key and k1 is not the first key.
I would recommend that you simply define another index:
create index idx_tab1_k1 on tab1(k1);
There are several ways to avoid a full-table scan
Indexes: Ensure that indexes exist on the key value and that the index has been analyzed with dbms_stats.
Use_nl hint: You can direct that the optimizer use a nested loops join (which requires indexes).
index hint: You can specify the indexes that you want to use.

Oracle partition key

I have many tables with large amount of data. The PK is the column (TAB_ID) which has data type RAW(16). I created the hash partitions with partition key having the TAB_ID column.
My issue is: the SQL statement (select * from my_table where tab_id = 'aas1df') does not use partition pruning. If I change the column datatype to varchar2(32), partition pruning works.
Why does not partition pruning work with partition key which have datatype RAW(16)?
I'm just guessing: try select * from my_table where 'aas1df' = tab_id.
Probably the datatype conversion works other way that expected. Anyway you should use the function UTL_RAW.CAST_TO_RAW.
Edited:
Is your table partitioned by TAB_ID? If yes, then there is something wrong with your design, you usually partition table by some useful business value, NOT by surrogate key.
If you know the PK value you do not need partition pruning at all. When Oracle traverses the PK index it gets ROWID value. This ROWID contains file-number, block ID and also row number within the block. So Oracle can access directly the row.
HEXTORAW enables partition pruning.
In the sample below the Pstart and Pstop are literal numbers, implying partition pruning occurs.
create table my_table
(
TAB_ID raw(16),
a number,
constraint my_table_pk primary key (tab_id)
)
partition by hash(tab_id) partitions 16;
explain plan for
select *
from my_table
where tab_id = hextoraw('1BCDB0E06E7C498CBE42B72A1758B432');
select * from table(dbms_xplan.display(format => 'basic +partition'));
Plan hash value: 1204448714
--------------------------------------------------------------------------
| Id | Operation | Name | Pstart| Pstop |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | TABLE ACCESS BY GLOBAL INDEX ROWID| MY_TABLE | 2 | 2 |
| 2 | INDEX UNIQUE SCAN | MY_TABLE_PK | | |
--------------------------------------------------------------------------

How should i design my tables for concurrent table scan access?

I need to hold multiple pairs of 70,000 rows and perform a comparison difference like operation between them using a minus operator. At any time there could be comparisons (table scans).
I currently have one table with this sort of design:
primary key (sequenced)
foreign key to identify set
key to identify set #1 or set #2
then the data here i need to minus against
The data would look something like this
| PK | FK | Key | Data |
| 1 | 1 | Left | Some data |
| 1 | 1 | Left | Diff data |
| 1 | 1 | Right | Some data |
My query would be:
SELECT data
FROM diffTable
WHERE pk = 1
AND fk = 1
AND key = 'Left'
MINUS
SELECT data
FROM diffTable
WHERE pk = 1
AND fk = 1
AND key = 'Right'
I am fearing the full table scans will monopolise resources and subsequent inserts and minus' will suffer.
How should I design my tables and why?
create index PK_FK on diff_table
(PK, FK, Key);
The query you posted in your question would run very fast with this index.
Btw, the PK column is not, by itself, the primary key. See the other comments. Perhaps you want:
alter table diff_table
add constraint PK_FK primary key (PK, FK, Key);
maybe pick a better name...

Resources