I maintain a table in Oracle that contains several hundred thousand lines of code, including a priority column, which indicates for each line its importance according to the needs of the system.
ID
BRAND
COLOR
VALUE
SIZE
PRIORITY
EFFECTIVE_DATE_FROM
EFFECTIVE_DATE_FROM
1
BL
BLUE
58345
12
1
10/07/2022
NULL
2
TK
BLACK
4455
1
1
10/07/2022
NULL
3
TK
RED
16358
88
2
11/01/2022
NULL
4
WRA
RED
98
10
6
18/07/2022
NULL
5
BL
BLUE
20942
18
7
02/06/2022
NULL
At any given moment thousands more rows may enter the table, and it is necessary to SELECT from it the 1000 rows with the highest priority.
Although the naive solution is to SELECT using ORDER BY PRIORITY ASC, we find that the process takes a long time when the table contains a very large amount of rows (say over 2 million records).
The solutions proposed so far are to divide the table into 2 different tables, so that in advance the records with priority 1 will be entered into Table A, and the rest of the records will be entered in Table B, and then we will SELECT using UNION between the two tables.
This way we save the ORDER BY process since Table A always contains priority 1, and it remains to sort only the data in Table B, which is expected to be smaller than the overall large table.
On the other hand, it was also suggested to leave the large table in place, and perform the SELECT using PARTITION BY on the priority column.
I searched the web for differences in speed or efficiency between the two options but did not find one, and I am debating how to do it. So which of the options is preferable if we focus on the efficiency and complexity of time?
Related
For a racing game, I have this highscore table (it contains more columns to indicate the race track and other things, but let's ignore that):
Rank | ID | Time | Datetime | Player
1 | 4ef9b | 8.470 | today 13:00 | Bob
2 | 23fcf | 8.470 | today 13:04 | Carol
3 | d8512 | 8.482 | today 12:47 | Alice
null | 0767c | 9.607 | today 12:51 | Alice
null | eec81 | 9.900 | today 12:55 | Bob
The Rank column is precomputed and reflects ORDER BY Time, Datetime but uniqued such that each player has only their best entry ranked. (The non-personal-best records, with null Rank, are relevant for historic graphs.)
When inserting a new highscore, there are at least two ways to update the ranks:
Insert the new row and invalidate their old rank, then periodically read the whole table using ORDER BY, deduplicate players, and issue UPDATE queries in batches of 1000 by using long lists of CASE id WHEN 4ef9b THEN rank=1, WHEN 23fcf THEN rank=2, etc., END. The database server is clever enough that if I try setting Carol's rank to 2 again, it will see it's the same and not do a disk write, so it's less-than-horribly inefficient.
Only update the rows that were changed by doing:
oldID, oldRank = SELECT ID, Rank WHERE Player=$player, Rank IS NOT NULL
newRank = SELECT MAX(Rank) WHERE time < $newTime;
UPDATE Rank+=1 WHERE Rank IS NOT NULL AND Rank >= $newRank AND Rank < $oldRank
INSERT (Rank=$newRank, Time=$newTime, Player=..., etc.)
UPDATE Rank=null WHERE ID=$oldID
I implemented the latter because it seemed to be optimally efficient (only selecting and touching rows that need changing), but this actually took the server down upon a flood of new personal best times due to new level releases. It turns out that periodically doing the less-efficient former method actually creates a lot less load, but I'd have to implement a queueing mechanism for rank invalidation which feels like extra complexity on top of inefficiency.
One problem I have is that my database calls fsync after each query (and has severe warnings against turning that off), so doing 100 row updates in one query might take 0.83 seconds whereas doing 100 queries takes 100 × (fsync_time+query_time) = e.g. 1000×0.51 = 51 seconds. This might be why changing Rank rows along with every insert is such a burden on the system, so I want to batch this by storing an ordered list of (oldrank, newrank) pairs and applying them all at once.
What algorithm can be used to compute the batch update? I could select the whole ranked list from the database into a big hashmap (map[rank] = ID), apply any number of rank changes to this memory object, build big CASE WHEN strings to update a thousand rows in one query, and send those updates to the database. However, as the number of players grows, this hashmap might not fit in memory.
Is there a way to do this based on ranges of ranks, instead of individual ranks? The list of incoming changes, such as:
Bob moves from rank 500 to rank 1
Carol moves from rank 350 to rank 100
should turn into a list of changes to make for each rank:
rank[1-99] +=1
rank[100-349] += 2
rank[351-499] += 1
without having a memory object that needs O(n) memory, where n is the number of ranked scores in the database for one race track. In this case, the two changes expand to three ranges, spanning five hundred rank entries. (Changing each row in the database will still have to happen; this can probably not be helped without entirely changing the setup.)
I am using a standard LAMP stack, in case that is relevant for an answer.
Is there a way to return data from ClickHouse not by rows but by columns?
So instead of result in a following form for columns a and b
a
b
1
2
3
4
5
6
I'd get a transposed result
-
-
-
1
3
5
2
4
6
The point is I want to access data per column, eg. iterate over everything in column a.
I was checking available output formats - Arrow would do but it is not supported by my platform for now.
I'm looking for a most effective way. E.g. considered ClickHouse stores data in columns already, it does not have to process it into rows so I can transfer it back to columns using array functions afterwards. I'm not familiar with internals very much but I was wondering that I could somehow skip transposing rows if data are already in columns.
Obviously there is no easy way to do it.
And a bigger issue that it's against the SQL conception.
You can use native protocol, although you will get columns in blocks by 65k rows.
col_a 65k values, col_b 65k values, col_a next 65k values, col_b next 65k values
Let's say I have a Google sheet with tab1 and tab2
in tab 1 I have 2000 rows and 20 columns filled with data and 1000 rows are empty, so I have 3000 rows.
In tab2 I have a few formulas like vlookup and some if functions.
The options I can think of are:
I can name the range of the data in tab1 and use that in the formula(s) and if the range expands, I can edit the range
I can use option B:B
I can delete the empty rows and use B:B
what is the fastest way?
all three of those options have no real word effect on the overall performance given that you have only 3000 rows across 20 columns. the biggest impact on performance you can have is from QUERYs, IMPORTRANGEs and ARRAYFORMULAs if fed by a huge amount of data (10000+ rows) or if you have extensive calculations with multiple sub-steps consisting of whole virtual arrays.
I have a huge table (more than 1 billion rows) in Impala. I need to sample ~ 100,000 rows several times. What is the best to query sample rows?
As Jeff mentioned, what you've asked for exactly isn't possible yet, but we do have an internal aggregate function which takes 200,000 samples (using reservoir sampling) and returns the samples, comma-delimited as a single row. There is no way to change the number of samples yet. If there are fewer than 200,000 rows, all will be returned. If you're interested in how this works, see the implementation of the aggregate function and reservoir sampling structures.
There isn't a way to 'split' or explode the results yet, either, so I don't know how helpful this will be.
For example, sampling trivially from a table with 8 rows:
> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id) |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s
(For context: this was added in a past release to support histogram statistics in the planner, which unfortunately isn't ready yet.)
Impala does not currently support TABLESAMPLE, unfortunately. See https://issues.cloudera.org/browse/IMPALA-1924 to follow its development.
In retrospect, knowing that TABLESAMPLE is unavailable, one could add a field "RVAL" (random 32-bit integer, for instance) to each record, and sample repeatedly by adding "where RVAL > x and RVAL < y", for appropriate values of x and y. Non-overlapping intervals [x1,y1], [x2,y2],... will be independent. You can also select using "where RVAL%10000 = 1, =2, ... etc, for a separate population of independent subsets.
TABLESAMPLE mentioned in other answers is now available in newer versions of impala (>=2.9.0), see documentation.
Here's an example of how you could use it to sample 1% of your data:
SELECT foo FROM huge_table TABLESAMPLE SYSTEM(1)
or
SELECT bar FROM huge_table TABLESAMPLE SYSTEM(1) WHERE name='john'
Looks like percentage argument must be an integer, so the smallest sample you can take is limited to 1%.
Keep in mind that the proportion of sampled data from the table is not guaranteed and may be greater than the specified percentage (in this case more than 1%). This is explained in greater detail in Impala's documentation.
If you are looking for sample over certain column(s), you can check below answer.
Say, you have global data and you want to pick 10% from them randomly and create your dataset. You can use any combination of columns too - like city, zip code and state.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 10/100 -- This is for 10% data
Link -
Randomly sampling n rows in impala using random() or tablesample system()
I had an Oracle query as below that took 10 minutes or longer to run:
select
r.range_text as duration_range,
nvl(count(c.call_duration),0) as calls,
nvl(SUM(call_duration),0) as total_duration
from
call_duration_ranges r
left join
big_table c
on c.call_duration BETWEEN r.range_lbound AND r.range_ubound
and c.aaep_src = 'MAIN_SOURCE'
and c.calltimestamp_local >= to_date('01-02-2014 00:00:00' ,'dd-MM-yyyy HH24:mi:ss')
AND c.calltimestamp_local <= to_date('28-02-2014 23:59:59','dd-MM-yyyy HH24:mi:ss')
and c.destinationnumber LIKE substr( 'abc:1301#company.com:5060;user=phone',1,8) || '%'
group by
r.range_text
order by
r.range_text
If I changed the date part of the query to:
(c.calltimestamp_local+0) >= to_date('01-02-2014 00:00:00' ,'dd-MM-yyyy HH24:mi:ss')
(AND c.calltimestamp_local+0) <= to_date('28-02-2014 23:59:59','dd-MM-yyyy HH24:mi:ss')
It runs in 2 seconds. I did this based on another post to avoid using the date index. Seems counter intuitive though--the index slowing things down so much.
Ran the explain plan and it seems identical between the new and updated query. Only difference is the the MERGE JOIN operation is 16,269 bytes in the old query and 1,218 bytes in the new query. Actually cardinality is higher in the old query as well. And I actually don't see an "INDEX" operation on the old or new query in the explain plan, just for the index on the destinationnumber field.
So why is the index slowing down the query so much? What can I do to the index--don't think using the "+0" is the best solution going forward...
Querying for two days of data, suppressing use of destinationnumber index:
0 SELECT STATEMENT ALL_ROWS 329382 1218 14
1 SORT GROUP BY 329382 1218 14
2 MERGE JOIN OUTER 329381 1218 14
3 SORT JOIN 4 308 14
4 TABLE ACCESS FULL CALL_DURATION_RANGES ANALYZED 3 308 14
5 FILTER
6 SORT JOIN 329377 65 1
7 TABLE ACCESS BY GLOBAL INDEX ROWID BIG_TABLE ANALYZED 329376 65 1
8 INDEX RANGE SCAN IDX_CDR_CALLTIMESTAMP_LOCAL ANALYZED 1104 342104
Querying for 2 days using destinationnumber index:
0 SELECT STATEMENT ALL_ROWS 11 1218 14
1 SORT GROUP BY 11 1218 14
2 MERGE JOIN OUTER 10 1218 14
3 SORT JOIN 4 308 14
4 TABLE ACCESS FULL CALL_DURATION_RANGES ANALYZED 3 308 14
5 FILTER
6 SORT JOIN 6 65 1
7 TABLE ACCESS BY GLOBAL INDEX ROWID BIG_TABLE ANALYZED 5 65 1
8 INDEX RANGE SCAN IDX_DESTINATIONNUMBER_PART ANALYZED 4 4
Querying for one month, suppressing destinationnumber index--full scan:
0 SELECT STATEMENT ALL_ROWS 824174 1218 14
1 SORT GROUP BY 824174 1218 14
2 MERGE JOIN OUTER 824173 1218 14
3 SORT JOIN 4 308 14
4 TABLE ACCESS FULL CALL_DURATION_RANGES ANALYZED 3 308 14
5 FILTER
6 SORT JOIN 824169 65 1
7 PARTITION RANGE ALL 824168 65 1
8 TABLE ACCESS FULL BIG_TABLE ANALYZED 824168 65 1
Seems counter intuitive though--the index slowing things down so much.
Counter-intuitive only if you don't understand how indexes work.
Indexes are good for retrieving individual rows. They are not suited to retrieving large numbers of records. You haven't bothered to provide any metrics but it seems likely your query is touching a large number of rows. In which case a full table scan or other set=based operation will be more much efficient.
Tuning date range queries is tricky, because it's very hard for the database to know how many records lie between the two bounds, no matter how up-to-date our statistics are. (Even more tricky to tune when the date bounds can vary - one day is a different matter from one month or one year.) So often we need to help the optimizer by using our knowledge of our data.
don't think using the "+0" is the best solution going forward...
Why not? People have been using that technique to avoid using an index in a specific query for literally decades.
However, there are more modern solutions. The undocumented cardinality hint is one:
select /*+ cardinality(big_table,10000) */
... should be enough to dissuade the optimizer from using an index - provided you have accurate statistics gathered for all the tables in the query.
Alternatively you can force the optimizer to do a full table scan with ...
select /*+ full(big_table) */
Anyway, there's nothing you can do to the index to change the way databases work. You could make things faster with partitioning, but I would guess if your organisation had bought the Partitioning option you'd be using it already.
These are the reasons using an index slows down a query:
A full tablescan would be faster. This happens if a substantial fraction of rows has to be retrieved. The concrete numbers depend on various factors, but as a rule of thumb in common situations using an index is slower if you retrieve more than 10-20% of your rows.
Using another index would be even better, because fewer rows are left after the first stage. Using a certain index on a table usually means that other indexes cannot be used.
Now it is the optimizers job to decide which variant is the best. To perform this task, he has to guess (among other things) how much rows are left after applying certain filtering clauses. This estimate is based on the tables statistics, and is usually quite ok. It even takes into account skewed data, but it might be off if either your statitiscs are outdated or you have a rather uncommon distribution of data. For example, if you computed your statistics before the data of february was inserted in your example, the optimizer might wrongly conclude that there are only few (if any) rows left after applaying the date range filter.
Using combined indexes on several columns might also be an option dependent on your data.
Another note on the "skewed data issue": There are cases the optimizer detects skewed data in column A if you have an index on cloumn A but not if you only have a combined index on columns A and B, because the combination might make the distribution more even. This is one of the few cases where an index on A,B does not make an index on A redundant.
APCs answer show how to use hints to direct the optimizer in the right direction if it still produces wrong plans even with right statistics.