Choice of a data type for a Paquet join columns (i.e., keys) - hadoop

With RDBMS we usually use a numeric columns for keys (both foreign and primary), as it allows for a better joined query performance and smaller resource usage, in most cases, than other data types (like strings).
The question is, what should be the data type of choice for the key columns in a Parquet tables? Can we go like this:
SELECT * FROM parquet_table1 JOIN parquet_table2 ON t1_string_pk = t2_string_fk
What is the best practice here?
The reason for this question is that when loading data into a data warehouse, any numeric key column (for a target table) requires a key table lookup ([source system, source key] -> surrogate key), and the string key column does not; we may use source key concatenation to get us a string surrogate key value.

Related

How to populate fact table with Surrogate keys from dimensions?

Could someone please help me understand how to populate the fact table with Surrogate keys from dimensions using SSIS?
I load my dimension tables and assign for each a surrogate key. I want to add these surrogate keys to my fact table but I don't know from where to start.
You just join your fact source record to the relevant dimension tables and get the surrogate keys, which you then insert into your fact table

Replace foreign key column with compressed index

I would like to spare some tables in my database.
One table for example has a simple Primary-Key-ID column and a VARCHAR2 column.
The VARCHAR2 column has NO duplicate values, yet different unique IDs.
The PK column of this table is just referenced once as a foreign key in another table.
My thoughts are now to insert the values from the VARCHAR2 column into the the table which has held the primary key.
I could now remove the foreign key reference, delete the table and gain a new column with all the (duplicate) VARCHAR2 values. These I would like to compress in a unique/distinct way.
I have heard about index in the Oracle Database to compress column(s) but I am not quite sure which index I need or how to use them...
The underlying feature (and storage savings) should be about as the same as it was with the previous table of unique values and the foreign key reference.
Thank you for your help in advance!
Oracle basic compression allows us to compress tables. It comes with several distinct limitations, not the least of which is that it isn't suitable for OLTP databases. Direct path inserts, updates and deletes don't benefit. So you can't do what you want that way. If your organisation has sprung for the Advanced Compression licence then you have more options, but the compression still works on the table not an individual column.
I think you've confused things with index compression, which does operate on columns, as it allows us to compress the leading column(s) of a compound index. But it's worth applying only when there's a lot of repetition in those columns. If your index has a unique ID for the leading column than compression will actually increase the total amount of space taken. (Just one reason why compound indexes should be built with the least selective column first and the most selective column last.)
Your table is a classic key-value lookup table. So you could consider converting it into an index-organized table. You would save yourself a bit of space by maintaining only a specialized index instead of a table and its primary key index. Find out more

Is a primary key necessary for greenplum database?

We know greenplum is a MPP data wirehouse, we will import data from mysql into it every day, the primary key may conflict from different source. I am designing the schema, I am not sure:
Is primary key required for each table?
From offical docs, the primary key is used for partition by default, but I can specify another key to partition, is there any other reason that I have to set a primary key?
No, a primary key is not needed in Greenplum. It will actually slow down your loading performance, take up storage space, and likely not be used for any queries.
The distribution key is often times set to be the logical primary key of a table but without an actual primary key created. The distribution key should be a high cardinality column like the primary key, which helps distribute the data evenly across the segments.
And you can specify another key for the distribution key too.
Lastly, I wouldn't call this a way to "partition" the data because partitioning is something else in Greenplum. Partitioning is akin to Oracle or SQL Server partitioning with the query optimizer eliminating partitions based on the conditions (where month = 1) in the query.

Will Oracle optimize if using it like a key-value store?

I am using Oracle 11.2. Our usecase is like NoSQL. But still we need to persist data in Oracle. The requirement is store three columns: key, name, value.
There can be millions of key, hundreds of name, value size is about 2k~8k. Different name can have the same key value.
If I just create a table in Oracle with these three column and use Key+Name as primary key.
Is it doable to let oracle store all rows with the same key together to same key space?
If rows with the same key increase, will oracle keep these value together in physical storage?

Oracle Table structure

I have a table in Oracle Database which has 60 columns. Following is the table structure.
ID NAME TIMESTAMP PROERTY1 ...... PROPERTY60
This table will have many rows. the size of the table will be in GBs. But the problem with the table structure is that in future if I have to add a new property, I have to change the schema. To avoid that I want to change the table structure to following.
ID NAME TIMESTAMP PROPERTYNAME PROPERTYVALUE
A sample row will be.
1 xyz 40560 PROPERTY1 34500
In this way I will be able to solve the issue but the size of the table will grow bigger. Will it have any impact on performance in terms on fetching data. I am new to Oracle. I need your suggestion on this.
if I have to add a new property, I have to change the schema
Is that actually a problem? Adding a column has gotten cheaper and more convenient in newer versions of Oracle.
But if you still need to make your system dynamic, in a sense that you don't have to execute DDL for new properties, the following simple EAV implementation would probably be a good start:
CREATE TABLE FOO (
FOO_ID INT PRIMARY KEY
-- Other fields...
);
CREATE TABLE FOO_PROPERTY (
FOO_ID INT REFERENCES FOO (FOO_ID),
NAME VARCHAR(50),
VALUE VARCHAR(50) NOT NULL,
CONSTRAINT FOO_PROPERTY_PK PRIMARY KEY (FOO_ID, NAME)
) ORGANIZATION INDEX;
Note ORGANIZATION INDEX: the whole table is just one big B-Tree, there is no table heap at all. Properties that belong to the same FOO_ID are stored physically close together, so retrieving all properties of the known FOO_ID will be cheap (but not as cheap as when all the properties were in the same row).
You might also want to consider whether it would be appropriate to:
Add more indexes in FOO_PROPERTY (e.g. for searching on property name or value). Just beware of the extra cost of secondary indexes in index-organized tables.
Switch the order of columns in the FOO_PROPERTY PK - if you predominantly search on property names and rarely retrieve all the properties of the given FOO_ID. This would also make the index compression feasible, since the leading edge of the index is now relatively wide string (as opposed to narrow integer).
Use a different type for VALUE (e.g. RAW, or even in-line BLOB/CLOB, which can have performance implications, but might also provide additional flexibility). Alternatively, you might even have a separate table for each possible value type, instead of stuffing everything in a string.
Separate property "declaration" to its own table. This table would have two keys: beside string NAME it would also have integer PROPERTY_ID which can then be used as a FK in FOO_PROPERTY instead of the NAME (saving some storage, at the price of more JOIN-ing).

Resources