ORGANIZE / DISTRIBTE BY with CREATE TABLE in MPP systems

ORGANIZE / DISTRIBTE BY with CREATE TABLE in MPP systems - parallel-processing

How is CREATE TABLE statement's performance influenced by ORGANIZE BY / DISTRIBUTE BY clause in MPP systems (Netezza / Teradata / Synapse)?
Also, what key should be picked for distribution in such MPP system?

The short answer is that as long as you choose a ‘distribution’ that enables a very even spread of rows across nodes, then it has little to no effect on insert performance.
On Netezza the ‘organize’ is not enforced while you create the table, not even while you insert/update data. The Groom operation does that on your request later. Side note: When doing CTAS or large inserts you should ‘order by’ in the insert statement if possible.
About choosing ‘distribution’ columns:
always ensure a very even spread (if not possible: RANDOM is better)
and never use more than one column
and choose one you plan to do a lot of ‘equal join’ on (a LOT)
About choosing ‘organization’ columns
only consider columns you plan to do a lot of ‘simple where’ clauses against (=,<,>, LIKE)
tend towards those with few distinct values
and ‘time’ columns are always a good guess
Hint: sometimes you get a ‘free’ organize effect on a column because it is closely related to another column that you already organize on.
Example: If the create_date on average is less than 30 days away from the end_date in a table with 5 years of data, then you will have that effect

Related

Data Types and Indexes

Is there some sort of performance difference for inserting, updating, or deleting data when you use the TEXT data type?
I went here and found this:
Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a
few extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, there is no such advantage
in PostgreSQL; in fact character(n) is usually the slowest of the
three because of its additional storage costs. In most situations text
or character varying should be used instead.
This makes me believe there should not be a performance difference, but my friend, who is much more experienced than I am, says inserts, updates, and deletes are slower for the TEXT data type.
I had a table that was partitioned with a trigger and function, and extremely heavily indexed, but the inserts did not go that slow.
Now I have another table, with 5 more columns all of which are text data type, the same exact trigger and function, no indexes, but the inserts are terribly slow.
From my experience, I think he is correct, but what do you guys think?
Edit #1:
I am uploading the same exact data, just the 2nd version has 5 more columns.
Edit #2:
By "Slow" I mean with the first scenario, I was able to insert 500 or more rows per second, but now I can only insert 20 rows per second.
Edit #3: I didn't add the indexes to the 2nd scenario like they are in the 1st scenario because indexes are supposed to slow down inserts, updates, and deletes, from my understanding.
Edit #4: I guarantee it is exactly the same data, because I'm the one uploading it. The only difference is, the 2nd scenario has 5 additional columns, all text data type.
Edit #5: Even when I removed all of the indexes on scenario 2 and left all of them on scenario 1, the inserts were still slower on scenario 2.
Edit #6: Both scenarios have the same exact trigger and function.
Edit #7:
I am using an ETL tool, Pentaho, to insert the data, so there is no way for me to show you the code being used to insert the data.
I think I might have had too many transformation steps in the ETL tool. When I tried to insert data in the same transformation as the steps that actually transform the data, it was massively slow, but when I simply inserted the data already transformed into an empty table and then inserted data from this table into the actual table I'm using,the inserts were much faster than scenario 1 at 4000 rows per second.
The only difference between scenario 1 and scenario 2, other than the increase in columns in scenario 2, is the number of steps in the ETL transformation.Scenario two has about 20 or more steps in the ETL transformation. In some cases, there are 50 more.
I think I can solve my problem by reducing the number of transformation steps, or putting the transformed data into an empty table and then inserting the data from this table into the actual table I'm using.

PostgreSQL text and character varying are the same, with the exception of the (optional) length limit for the latter. They will perform identically.
The only reasons to prefer character varying are
you want to impose a length limit
you want to conform with the SQL standard

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?

Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.

Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

Partitioning or bucketing hive table based on only month/year to optimize queries

I'm building a table that contains about 400k rows of a messaging app's data.
The current table's columns looks something like this:
message_id (int)| sender_userid (int)| other_col (string)| other_col2 (int)| create_dt (timestamp)
A lot of queries I would be running in the future will rely on a where clause involving the create_dt column. Since I expect this table to grow, I would like to try and optimize it right now. I'm aware that partitioning is one way, but when I partition it based on create_dt the result is too many partitions since I have every single date spanning back to Nov 2013.
Is there a way to instead partition by a range of dates? How about partition for every 3 months? or even every month? If this is possible - Could I possibly have too many partitions in the future making it inefficient? What are some other possible partition methods?
I've also read about bucketing, but as far as I'm aware that's only useful if you would be doing joins on a column that the bucket is based on. I would most likely be doing joins only on column sender_userid (int).
Thanks!

I think this might be a case of premature optimization. I'm not sure what your definition of "too many partitions" is, but we have a similar use case. Our tables are partitioned by date and customer column. We have data that spans back to Mar 2013. This created approximately 160k+ partitions. We also use a filter on date and we haven't seen any performance problems with this schema.
On a side note, Hive is getting better at scaling up to 100s of thousands of partitions and tables.
On another side note, I'm curious as to why you're using Hive in the first place for this. 400k rows is a tiny amount of data and is not really suited for Hive.

Check out hive built in UDFs. With the right combination of them you can achieve what you want. Here's an example to partition on every month (produces "YEAR-MONTH" string that you can use as partition column value):
select concat(cast(year(to_date(create_dt)) as string),'-',cast(month(to_date(create_dt)) as string))
But when partitioning on dates it is usually useful to have multiple levels of the date dimension so in this case you should have two partition columns, first for year and second for month:
select year(to_date(create_dt)),month(to_date(create_dt))
Keep in mind that timestamps and dates are strings, and that functions like month() or year() return integers as values of date fields. You can use simple mathematical operations to figure out the right partition.

Create Oracle partitioning table without PK

We have a huge table which are 144 million rows available right now and also increasing 1 million rows each day.
I would like to create a partitioning table on Oracle 11G server but I am not aware of the techniques. So I have two question :
Is it possible to create a partitioning table from a table that don't have PK?
What is your suggestion to create a partitioning table like huge records?

Yes, but keep in mind that the partition key must be a part of PK
Avoid global indexes
Chose right partitioning key - have it prepared for some kind of future maintenance ( cutting off oldest or unnecessary partitions, placing them in separate tablespaces... etc)
There are too many things to consider.

"There are several non-unique index on the table. But, the performance
is realy terrible! Just simple count function was return result after
5 minutes."
Partitioning is not necessarily a performance enhancer. The partition key will allow certain queries to benefit from partition pruning i.e. those queries which drive off the partition key in the WHERE clause. Other queries may perform worse, if there WHERE clause runs against the grain of the partition key,
It is difficult to give specific advice because the details you've posted are so vague. But here are some other possible ways of speeding up queries on big tables:
index compression
parallel query
better, probably compound, indexes.

Oracle 'pseduo-fact' view

Assumptions:
I have a number of tables comprised of facts and foreign keys ('dimensional' and 'key-value' type). For example, ENCOUNTER:
ID - primary key
dimensions
LOCATION_ID
PATIENT_ID
key-value
TYPE_ID
STATUS_ID
PATIENT_CLASS_ID
DISPOSITION_ID
...
facts
ADMISSION_DATE
DISCHARGE_DATE
...
I don't have the option to create a data warehouse
I would like to simplify the data structure for reporting
My approach is to create a number of pseudo-dimensional views ('D_LOCATION' based on the DEPARTMENT and LOCATION tables) and pseudo-fact views ('F_ENCOUNTER' based on ENCOUNTER table). In the pseudo-fact view, I would JOIN the key-value tables (e.g. STATUS, PATIENT_CLASS) to the fact table to include the name fields (e.g. STATUS.NAME, PATIENT_CLASS.NAME).
Questions:
If a query selects a subset of all of the fields from F_ENCOUNTER (i.e. not all of the key-value.name fields), is the Oracle 10g optimizer smart enough to exclude some of the key-value table joins (i.e. the ones that aren't included in the query)?
Is there anything that I can do to optimize this architecture (other than indices)
Is there another approach?
** edit **
Goals (in order of importance):
reduce query complexity; increase query consistency; decrease report-development time
optimize query-processing
minimize administrator burden
decrease storage

One optimization suggestion is not to use key-value pair tables. The concept of a Dimension table is that each record should contain all information about that concept without needing to join to normalized tables - i.e. turning a star schema into a snowflake schema.
While values might be repeated across dimension table records, it has the advantage of fewer joins in your reporting queries. Denormalizing tables in this way might seem counter intuitive but where performance is paramount it is usually the best solution.

I don't believe Oracle would exclude any joins done in the view, because the joins can impact the number of rows returned. (As when an inner join fails to match any rows, making the whole result set empty.)
What are the goals of your optimization? Query speed? query simplicity? storage efficiency? If you can sacrifice storage efficiency for better query performance, then replace the key-value references with the values themselves (TYPE_NAME instead of TYPE_ID, PATIENT_CLASS_NAME instead of PATIENT_CLASS_ID, etc.).
[Edit:] If the original architecture cannot be modified, consider using a materialized view. It would essentially pre-compute the joins and store the result set, giving you speedy query time at the cost of extra storage space and possibly-not-fresh data. You can control the latter by specifying an appropriate refresh policy. See http://en.wikipedia.org/wiki/Materialized_view and http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/mv.htm for further details.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio