Database index that support arbitrary sort order - algorithm

Is there database index type (or data structure in general, not just B-tree) that provides efficient enumeration of objects sorted in arbitrarily customizable order?
In order to execute query like below efficiently
select *
from sample
order by column1 asc, column2 desc, column3 asc
offset :offset rows fetch next :pagesize rows only
DBMSes usually require composite index with the fields mentioned in "order by" clause with the same order and asc/desc directions. I.e.
create index c1_asc_c2_desc_c3_asc on sample(column1 asc, column2 desc, column3 asc)
The order of index columns does matter, and the index can't be used if the order of columns in "order by" clause does not match.
To make queries with every possible "order by" clause efficient we could create indexes for every possible combination of sort columns. But this is not feasible since the number of indexes depends on the number of sort columns exponentionally.
Namely, if k is the number of sort columns, k! will be the number of permutation of the sort columns, 2k will be every possible combination of asc/desc directions, then the number of indexes will be (k!·2(k-1)). (Here we use 2(k-1) instead of 2k because we assume that DBMS will be smart enough to use the same index in both direct and reverse directions, but unfortunately this doesn't help much.)
So, I wish to have something like
create universal index c1_c2_c3 on sample(column1, column2, column3)
that would have the same effect as 24 (k=3) plain indexes (that cover every "order by"), but consume reasonable disk/memory space. As for reasonable disk/memory space I think that O(k·n) is ok, where k is the number of sort columns, n is the number of rows/entries and assuming that ordinary index consumes O(n). In other words, universal index with k sort columns should consume approximately as much as k ordinary indexes.
What I want looks to me as multidimensional indexes, but when I googled this term I have found pages that relate to either
ordinary composite indexes - this is not what I need for obvious reason;
spatial structures like k-d tree, quad/octo- tree, R-tree and so on, which are more suitable for the nearest-neighbor search problem rather than sorting.

Related

Persistent partitioning of the data

I am looking for the approach or algorithm that can help with the following requirements:
Partition the elements into a defined number of X partitions. Number of partitions might be redefined manually over time if needed.
Each partition should not have more than Y elements
Elements have a "category Id" and "element Id". Ideally all elements with the same category Id should be within the same partition. They should overflow to as few partitions as possible only if a given category has more than Y elements. Number of categories is orders of magnitude larger than number of partitions.
If the element from the set has been previously assigned to a given partition it should continue being assigned to the same partition
Account for change in the data. Existing elements might be removed and new elements within each of the categories can be added.
So far my naive approach is to:
sort the categories descending by their number of elements
keep a variable with a count-of-elements for a given partition
assign the rows from the first category to the first partition and increase the count-of-elements
if count-of-elements > Y: assign overflowing elements to the next partition, but only if the number of elements in a category is bigger than Y. Otherwise assign all elements from a given category to the next partition
continue till all elements are assigned to partitions
In order to persist the assignments store in the database all pairs: (element Id, partition Id)
On the consecutive re-assignments:
remove from the database any elements that were deleted
assign existing elements to the partitions based on (element Id, partition Id)
for any new elements follow the above algorithm
My main worry is that after few such runs we will end up with categories spread all across the partitions as the initial partitions will get all filled in. Perhaps adding a buffer (of 20% or so) to Y might help. Also if one of the categories will see a sudden increase in a number of elements the partitions will need rebalancing.
Are there any existing algorithms that might help here?
This is NP hard (knapsack) on NP hard (finding optimal way to split too large categories) on currently unknown because of future data changes. Obviously the best that you can do is a heuristic.
Sort the categories by descending size. Using a heap/priority queue for the partitions, put each category into the least full available partition. If the category won't fit, then split it as evenly as you can into the smallest number of possible partitions. My guess (experiment!) is that trying to leave partitions at the same fill is best.
On reassignment, delete the deleted elements first. Then group new elements by category. Sort the categories by how many preferred locations they have ascending, and then by descending size. Now move the categories with 0 preferred locations to the end.
For each category, if possible split its new elements across the preferred partitions, leaving them equally full. If this is not possible, put them into the emptiest possible partition. If that is not possible, then split them to put them across the fewest possible partitions.
It is, of course, possible to come up with data sets that eventually turn this into a mess. But it makes a pretty good good faith effort to try to come out well.

the time performance of inserting into a hash table using external chaining

Suppose I am going to inset a new element into a hash table using External Chaining. If the table is with resizing, I know the time of the insert operation is big theta 1.
However, I don't understand why the performance is different if the bucket is of fixed size. Shouldn't it be inserting into a linked list, which is also big theta 1?
This is from the slide of CS61B #UCB.
The "fixed size" vs "resizing" refers to the number of buckets, rather than the size of each individual bucket.
The idea is that if we have a fixed number of buckets, let's say k buckets, and we insert n elements into the hash table, then with a hash function with perfect spread, each bucket will hold k/n elements in it.
Since it would take us O(k/n) to look through all of the items in the bucket, and k is just a constant because it is fixed, our lookup time is O(n).

Minimum sets to cover all sub arrays

I am explaining this question with little modification so that it becomea easy for me to explain.
There are n employees and I need to arrange an outing for them on such a day of a month on which all (or max) employees would be available for outing.
Each employee is asked to fill up an online survey stating his availability e.g. 1-31 or 15-17 etc. etc. And some might not be available for even a single day too.
There are no restrictions on the number of trips I have to arrange to cover all employees (not considering who arent available the whole month), but i want to find out minimum set of dates so as to cover all the employees. So in worst case scenario I will have to arrange trip 31 times.
Question: what is the best data structure I can use to run the best fitting algorithm on this data structure? What is the best possible way to solve this problem?
By best of course I mean time and space efficient way but I am also looking for other options to solve it.
The way I think is to maintain an array for 31 ints and initialize it to 0. Run over each employee and based on their available dates increment the array index. At the end sort the array of 31. The maximum value represents the date on qhich max employees are available. And apply the same logic on the left out employees. But the problem is to remove the left out employees. For which I will have to run over whole list of employees once to know which employees can be removed and form a new list of left out employees on which I can apply the previous logic. Running over the list twice this way to remove the employees isnt the best according to me. Any ideas?
As a first step, you should exclude employees with no available dates.
Then you problem becomes a variant of Set Cover Problem.
Your universe U is all employees, and collections of sets S are days. For each day i, you have employee j is in set S[i] iff that employee is available on day i.
That problem is NP-hard. So, unless you want an approximate solution, you must check every 31^2 combination of days, probably with some pruning.
Select an array from 1 to 31(each index is representing dates of a month).for each date you have to create a linked list(doubly) contains the emp_id who are available on that days(you can simultaneously create this list which will be sorted based on emp_id,and you can keep the information about the size of the list and the index of array which maximum employees).
The largest list must be in the solution(take it as first date).
Now compare each list with the largest list and remove those employees from the list which are already present in the selected largest list.
now do the same procedure and find the second date and so on...
this whole procedure will run in O(n^2)(because 31 is a constant value).
and space will be O(n).

Oracle selectivity / cardinality

According to this:
Selectivity is the value between 0 and 1, and it is the fraction of rows returned after applying a filter on the table.
For example if
a table has 10,000 rows and the query returns 2601 rows, the
selectivity would be 2601/10000 or .26 or 26 percent. Selectivity
enables you (or optimizer for that matter) to decide which data access
method is optimum in the execution plan.
I need some clarifications: ok, that table has 10000 rows and only 2601 are returned by the query. But what if, that query block contains three joined tables, or it contains a subquery in the where clause? So in the FROM clause there are three tables, and the fourth table is part of a where clause subquery, then how is this selectivity calculated?
Selectivity = number of rows satisfying a condition (from which table?) / total number of rows (from all the four tables?)
Same question for cardinality (cardinality = selectivity * total number of rows).
I found many articles about this, but each of them exemplifies these concepts with simple select statements, based on a single table or a single where clause condition.
Can someone give me an example of how are these measures calculated in case of a bit more complex query (on "hr" schema, or other training purpose schema), meaning subqueries in the FROM clause, or WHERE clause, and so on?
Thank you.
EDIT:
I need some clarification about the selectivity measure, computed by the Estimator (Cost-Based Optimizer).
http://gerardnico.com/wiki/database/oracle/selectivity
For example, for an equality predicate (last_name = 'Smith'), selectivity is set to the reciprocal of the number n of distinct values of last_name, because the query selects rows that all contain one out of n distinct values.
I don't know how to understand that "reciprocal of the number n of distinct values".
Assuming the employees table have 107 rows, and the query
Code: [Select all] [Show/ hide]
select * from employees where last_name = 'Smith'
returns 2 rows, the selectivity is 2/107 = 0.01? So it's the number of rows satisfying the predicate / total number of rows. So no "distinct" involved in this equation.
Apart from this selectivity of the statement, there is also a column selectivity, which is represented by the NDV (number of distinct values in that column - which can be queried from dba_tab_col_statistics) / total number of rows (http://www.runningoracle.com/product_info.php?products_id=233). So if NDV is 103, the selectivity of the last_name column is 103/107 = 0,96.
This is what I understood.. is this correct? If I'm wrong, please correct me.
Thank you.
Selectivity is always based on whatever criteria are being applied at that time.
Exactly what this means for a particular table depends on the join order.
Oracle will always start executing a query by selecting rows from a particular table on it's own. In this case the selectivity is straightforward as per the examples you have read. There are no join conditions to take into account at this point.
Next, it joins in a second table. Oracle makes an estimate of how many rows will satisfy both the constant conditions directly on that table only, along with any join conditions from the first table. The later is called "join selectivity".
Then, when joining the third table it estimates based on joining to the first two tables as well as any constant conditions.
This is one of the reasons that join order is so important to a plan.

Is it faster to insert a contiguous series of numbers or random numbers into a SQLite btree-index

Let's say we have a large dataset that has to get into the SQLite database, 250 million items. Let's say the table is
create table foo (myInt integer, name text)
and myInt is indexed but is not unique. There's no primary key.
The values are between 1 and 250000000 and duplicates are very very rare but not impossible. That is intentional/by design.
Given the way the b-tree algorithms work (and ignoring other factors) which is the faster insert, and why?
(a) dataset is first sorted on myInt column (ascending or descending) and the data
rows are then inserted in their pre-sorted order into SQLite
(b) dataset is inserted in a totally random order
Absolutely (a).
Random insertion in a btree is much slower.

Resources