I have a boolean table with ~100000 rows and 3000 columns. I know that many of these columns are redundant because they are the result of OR between some other columns in the table (even maybe up to 200 other columns). The structure is also nested - some columns may be constructed from ORing other columns and then themselves be involved in creating a new column.
My goal is to reduce this table into "base columns" and a list of OR rules which allow me to reconstruct the original data.
Is there any way to do this in a reasonable amount of run-time?
Related
My users table doesn't have many rows... yet. š
Might the query plan of the same query change as the table grows?
I.e., to see how my application will scale, should I seed my users table with BILLIONS š¤ of rows before using EXPLAIN?
Estimated row counts are probably the most important factor that influence which query plan is chosen.
Two examples that support this:
If you use a WHERE condition on an indexed column of a table, three things can happen:
If the table is very small or a high percentage of the rows match the condition, a sequential scan will be used to read the whole table and filter out the rows that match the condition.
If the table is large and a low percentage of the rows match the condition, an index scan will be used.
If the table is large and a medium percentage of rows match the condition, a bitmap index scan will be used.
If you join two tables, the estimated row counts on the tables will determine if a nested loop join is chosen or not.
I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?
Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.
Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.
I would like to know if it is possible to achieve the steps below in PL / SQL.
Please note that I use the word "partition" when I mean "put rows with a certain condition together" because a) I would like to avoid the word "group" because it combines rows in SQL, b) my research so far led me to think that the "PARTITION BY" clause is possibly what I want:
1. Select rows based on a long query with many joins,
partition the results based on a certain column value of type LONG.
2. Loop through each row of a partition and partition again,
based on another column of type VARCHAR.
Do that for every partition.
3. Loop through each row of the resulting sub-partition, compare multiple columns
with predefined values, set a boolean column to true or false based on the result.
Do that for every sub-partition.
It would be really easy to do for me in a normal programming language, such as Java. But can I do that in PL/SQL? If so, what would be a good approach?
IĀ“m currently working on optimzing my database schema in regards of index structures. As IĀ“d like to increase my DDL performance IĀ“m searching for potential drop candidates on my Oracle 12c system. HereĀ“s the scenario in which I donĀ“t know what the consequences for the query performance might be if I drop the index.
Given two indexes on the same table:
- non-unique, single column index IX_A (indexes column A)
- unique, combined index UQ_AB (indexes column A, then B)
Using index monitoring I found that the query optimizer didnĀ“t choose UQ_AB, but only IX_A (probably because itĀ“s smaller and thus faster to read). As UQ_AB contains column A and additionally column B IĀ“d like to drop IX_A. Though IĀ“m not sure if I get any performance penalties if I do so. Does the higher selectivity of the combined unique index have any influence on the execution plans?
It could do, though it's quite likely to be minor (usually). Of course it depends on various things, for example how large the values in column B are.
You can look at various columns in USER_INDEXES to compare the two indexes, such as:
BLEVEL: tells you the "height" of the index tree (well, height is BLEVEL+1)
LEAF_BLOCKS: how many data blocks are occupied by the index values
DISTINCT_KEYS: how "selective" the index is
(You need to have analyzed the table first for these to be accurate). That will give you an idea of how much work Oracle needs to do to find a row using the index.
Of course the only way to really be sure is to benchmark and compare timings or even trace output.
Assumptions:
I have a number of tables comprised of facts and foreign keys ('dimensional' and 'key-value' type). For example, ENCOUNTER:
ID - primary key
dimensions
LOCATION_ID
PATIENT_ID
key-value
TYPE_ID
STATUS_ID
PATIENT_CLASS_ID
DISPOSITION_ID
...
facts
ADMISSION_DATE
DISCHARGE_DATE
...
I don't have the option to create a data warehouse
I would like to simplify the data structure for reporting
My approach is to create a number of pseudo-dimensional views ('D_LOCATION' based on the DEPARTMENT and LOCATION tables) and pseudo-fact views ('F_ENCOUNTER' based on ENCOUNTER table). In the pseudo-fact view, I would JOIN the key-value tables (e.g. STATUS, PATIENT_CLASS) to the fact table to include the name fields (e.g. STATUS.NAME, PATIENT_CLASS.NAME).
Questions:
If a query selects a subset of all of the fields from F_ENCOUNTER (i.e. not all of the key-value.name fields), is the Oracle 10g optimizer smart enough to exclude some of the key-value table joins (i.e. the ones that aren't included in the query)?
Is there anything that I can do to optimize this architecture (other than indices)
Is there another approach?
** edit **
Goals (in order of importance):
reduce query complexity; increase query consistency; decrease report-development time
optimize query-processing
minimize administrator burden
decrease storage
One optimization suggestion is not to use key-value pair tables. The concept of a Dimension table is that each record should contain all information about that concept without needing to join to normalized tables - i.e. turning a star schema into a snowflake schema.
While values might be repeated across dimension table records, it has the advantage of fewer joins in your reporting queries. Denormalizing tables in this way might seem counter intuitive but where performance is paramount it is usually the best solution.
I don't believe Oracle would exclude any joins done in the view, because the joins can impact the number of rows returned. (As when an inner join fails to match any rows, making the whole result set empty.)
What are the goals of your optimization? Query speed? query simplicity? storage efficiency? If you can sacrifice storage efficiency for better query performance, then replace the key-value references with the values themselves (TYPE_NAME instead of TYPE_ID, PATIENT_CLASS_NAME instead of PATIENT_CLASS_ID, etc.).
[Edit:] If the original architecture cannot be modified, consider using a materialized view. It would essentially pre-compute the joins and store the result set, giving you speedy query time at the cost of extra storage space and possibly-not-fresh data. You can control the latter by specifying an appropriate refresh policy. See http://en.wikipedia.org/wiki/Materialized_view and http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/mv.htm for further details.