compress similar rows in oracle/SQL - oracle

I'm trying to consider a table design in oracle/SQL where many rows in a database are related to each other and will always be queried together but do contain different information (although many of the columns will contain similar information between the similar rows)
In that sense, it seems to me that it would be more efficient to somehow compress several rows into a single row in Oracle that contains a single common recordID and is always stored together on disk since they are always inserted, deleted, queried and extracted together. For this type of table, is there some sort of Row Compression that can be used so that these related rows aren't treated as individual rows for better performance?
Updated: An example would be as follows
Field1 Field2 Field3
1 1 A
1 2 B
1 3 C
2 4 D
2 5 E
In this example, I would always insert and query the first three rows together (because they share Field1 values). They are separate pieces of data, but they are never separated from each other. Is there some way to insert, store, index and extract them as a group while keeping them as separate data rows?

If you are on Exadata then you can go for Exadata Hybrid Columnar Compression.
http://www.oracle.com/technetwork/database/exadata/ehcc-twp-131254.pdf
If you are not on Exadata you can still use OLTP compression.
http://allthingsoracle.com/compression-in-oracle-part-3-oltp-compression/

Related

ClickHouse output by columns, not by rows (transpose without transpose actually?)

Is there a way to return data from ClickHouse not by rows but by columns?
So instead of result in a following form for columns a and b
a
b
1
2
3
4
5
6
I'd get a transposed result
-
-
-
1
3
5
2
4
6
The point is I want to access data per column, eg. iterate over everything in column a.
I was checking available output formats - Arrow would do but it is not supported by my platform for now.
I'm looking for a most effective way. E.g. considered ClickHouse stores data in columns already, it does not have to process it into rows so I can transfer it back to columns using array functions afterwards. I'm not familiar with internals very much but I was wondering that I could somehow skip transposing rows if data are already in columns.
Obviously there is no easy way to do it.
And a bigger issue that it's against the SQL conception.
You can use native protocol, although you will get columns in blocks by 65k rows.
col_a 65k values, col_b 65k values, col_a next 65k values, col_b next 65k values

Data Types and Indexes

Is there some sort of performance difference for inserting, updating, or deleting data when you use the TEXT data type?
I went here and found this:
Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a
few extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, there is no such advantage
in PostgreSQL; in fact character(n) is usually the slowest of the
three because of its additional storage costs. In most situations text
or character varying should be used instead.
This makes me believe there should not be a performance difference, but my friend, who is much more experienced than I am, says inserts, updates, and deletes are slower for the TEXT data type.
I had a table that was partitioned with a trigger and function, and extremely heavily indexed, but the inserts did not go that slow.
Now I have another table, with 5 more columns all of which are text data type, the same exact trigger and function, no indexes, but the inserts are terribly slow.
From my experience, I think he is correct, but what do you guys think?
Edit #1:
I am uploading the same exact data, just the 2nd version has 5 more columns.
Edit #2:
By "Slow" I mean with the first scenario, I was able to insert 500 or more rows per second, but now I can only insert 20 rows per second.
Edit #3: I didn't add the indexes to the 2nd scenario like they are in the 1st scenario because indexes are supposed to slow down inserts, updates, and deletes, from my understanding.
Edit #4: I guarantee it is exactly the same data, because I'm the one uploading it. The only difference is, the 2nd scenario has 5 additional columns, all text data type.
Edit #5: Even when I removed all of the indexes on scenario 2 and left all of them on scenario 1, the inserts were still slower on scenario 2.
Edit #6: Both scenarios have the same exact trigger and function.
Edit #7:
I am using an ETL tool, Pentaho, to insert the data, so there is no way for me to show you the code being used to insert the data.
I think I might have had too many transformation steps in the ETL tool. When I tried to insert data in the same transformation as the steps that actually transform the data, it was massively slow, but when I simply inserted the data already transformed into an empty table and then inserted data from this table into the actual table I'm using,the inserts were much faster than scenario 1 at 4000 rows per second.
The only difference between scenario 1 and scenario 2, other than the increase in columns in scenario 2, is the number of steps in the ETL transformation.Scenario two has about 20 or more steps in the ETL transformation. In some cases, there are 50 more.
I think I can solve my problem by reducing the number of transformation steps, or putting the transformed data into an empty table and then inserting the data from this table into the actual table I'm using.
PostgreSQL text and character varying are the same, with the exception of the (optional) length limit for the latter. They will perform identically.
The only reasons to prefer character varying are
you want to impose a length limit
you want to conform with the SQL standard

what is skewed column in Oracle

I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?
Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.
Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

oracle partitioning on columns frequently used in joins and where conditions

The customer table contains 9.5 million records. The customer_id column is the primary key. The database is Oracle.
Questions:
1) Should the table contain main partitions or sub-partitions? How do I decide?
Also, I don't think indexing columnA or columnB will help here because of the type of data.
TableA.columnA (varchar) has more than 80% of the records for columnA values 5,6,7. The columnA has values from 1 to 7 only.
TableA.columnB (varchar) has 90% of the records for columnB value = 102. The columnB has values from 1 to 999.
Moreover, the typical queries are (in no particular order):
Query1: where tableA.columnA = values
Query2: where tableA.columnB = values
Query3: where tableA.columnA = values AND/OR tableA.columnB = values
2) When we create sub-partitions, what happens if the query only contains a where clause for sub-partition column? Does the query execution go directly to sub-partition or through main partition?
3) the join contains tableA.partitioned_column = tableB.indexed_column
(eg. customer_Table.branch_code = branch_table.branch_code)
Does partitioning help in the case of JOIN? Will it improve performance?
1) It's very difficult to answer not knowing table structure, the way it's usually used etc. But generally for big tables partitioning is very often necessity.
2) If you will not specify partition then Oracle will have to browse through all partitions to find where the subpartition is (which is not very slow). And then use partition pruning on subpartition. It will be still significantly faster then not having subpartitions at all. But the best situation is to refer in WHERE to partition and subpartition.
3) For 99% I think it will help, because Oracle can use partition pruning to get at once needed rows from tableA. You will be 100% sure if you check query plan. But the best situation is when both column are partition keys.
If 80-90% of these columns have the same values and they are the most often queried values, then partitioning will help some. You would be pruning 10-20% of the data during these queries but you probably want to find another way for Oracle to hone in on the data your query needs (dates, perhaps?)
The value distribution in your two columns also brings up the point of statistics and making sure they are being gathered properly (with histograms to describe the skew in these columns).
As #psur points out, without knowing the details of your system it's hard give concrete suggestions.

Resources