Is it faster to insert a contiguous series of numbers or random numbers into a SQLite btree-index - b-tree

Let's say we have a large dataset that has to get into the SQLite database, 250 million items. Let's say the table is
create table foo (myInt integer, name text)
and myInt is indexed but is not unique. There's no primary key.
The values are between 1 and 250000000 and duplicates are very very rare but not impossible. That is intentional/by design.
Given the way the b-tree algorithms work (and ignoring other factors) which is the faster insert, and why?
(a) dataset is first sorted on myInt column (ascending or descending) and the data
rows are then inserted in their pre-sorted order into SQLite
(b) dataset is inserted in a totally random order

Absolutely (a).
Random insertion in a btree is much slower.

Related

Database index that support arbitrary sort order

Is there database index type (or data structure in general, not just B-tree) that provides efficient enumeration of objects sorted in arbitrarily customizable order?
In order to execute query like below efficiently
select *
from sample
order by column1 asc, column2 desc, column3 asc
offset :offset rows fetch next :pagesize rows only
DBMSes usually require composite index with the fields mentioned in "order by" clause with the same order and asc/desc directions. I.e.
create index c1_asc_c2_desc_c3_asc on sample(column1 asc, column2 desc, column3 asc)
The order of index columns does matter, and the index can't be used if the order of columns in "order by" clause does not match.
To make queries with every possible "order by" clause efficient we could create indexes for every possible combination of sort columns. But this is not feasible since the number of indexes depends on the number of sort columns exponentionally.
Namely, if k is the number of sort columns, k! will be the number of permutation of the sort columns, 2k will be every possible combination of asc/desc directions, then the number of indexes will be (k!·2(k-1)). (Here we use 2(k-1) instead of 2k because we assume that DBMS will be smart enough to use the same index in both direct and reverse directions, but unfortunately this doesn't help much.)
So, I wish to have something like
create universal index c1_c2_c3 on sample(column1, column2, column3)
that would have the same effect as 24 (k=3) plain indexes (that cover every "order by"), but consume reasonable disk/memory space. As for reasonable disk/memory space I think that O(k·n) is ok, where k is the number of sort columns, n is the number of rows/entries and assuming that ordinary index consumes O(n). In other words, universal index with k sort columns should consume approximately as much as k ordinary indexes.
What I want looks to me as multidimensional indexes, but when I googled this term I have found pages that relate to either
ordinary composite indexes - this is not what I need for obvious reason;
spatial structures like k-d tree, quad/octo- tree, R-tree and so on, which are more suitable for the nearest-neighbor search problem rather than sorting.

Create new hash table from existing hash table

Suppose we have a hash table with 2^16 keys and values. Each key can be represented as a bit string (e.g., 0000, 0000, 0000, 0000). Now we want to construct a new hash table. The key of new hash table is still a bit string (e.g., 0000, ****, ****, ****). The corresponding value would be the average of all values in the old hash table when * takes 0 or 1. For instance, the value of 0000, ****, ****, **** will be the average of 2^12 values in the old hash table from 0000, 0000, 0000, 0000 to 0000, 1111, 1111, 1111. Intuitively, we need to do C(16, 4) * 2^16 times to construct the new hash table. What's the most efficient way to construct the new hash table?
The hash table here is not helping you at all, although it isn't much of a hindrance either.
Hash tables cannot, by their nature, cluster keys by the key prefix. In order to provide good hash distribution, keys need to be distributed as close to uniformly as possible between hash values.
If you will need later to process keys in some specific ordering, you might consider an ordered associative mapping, such as a balanced binary tree or some variant of a trie. On the other hand, the advantage of processing keys in order needs to be demonstrated in order to justify the additional overhead of ordered mapping.
In this case, every key needs to be visited, which means the ordered mapping and the hash mapping will both be O(n), assuming linear time traverse and constant time processing, both reasonable assumptions. However, during the processing each result value needs two accumulated intermediaries, basically a running total and a count. (There is an algorithm for "on-line" computation of the mean of a series, but it also requires two intermediate values, running mean and count. So although it has advantages, reducing storage requirements isn't one of them.)
You can use the output hash table to store one of the intermediate values for each output value, but you need somewhere to put the other one. That might be another hash table of the same size, or something similar; in any case, there is an additional storage cost
If you could traverse the original hash table in prefix order, you could reduce this storage cost to a constant, since the two temporary values can be recycled every time you reach a new prefix. So that's a savings, but I doubt whether it's sufficient to justify the overhead of an ordered associative mapping, which also includes increased storage requirements.

the time performance of inserting into a hash table using external chaining

Suppose I am going to inset a new element into a hash table using External Chaining. If the table is with resizing, I know the time of the insert operation is big theta 1.
However, I don't understand why the performance is different if the bucket is of fixed size. Shouldn't it be inserting into a linked list, which is also big theta 1?
This is from the slide of CS61B #UCB.
The "fixed size" vs "resizing" refers to the number of buckets, rather than the size of each individual bucket.
The idea is that if we have a fixed number of buckets, let's say k buckets, and we insert n elements into the hash table, then with a hash function with perfect spread, each bucket will hold k/n elements in it.
Since it would take us O(k/n) to look through all of the items in the bucket, and k is just a constant because it is fixed, our lookup time is O(n).

Hash Table sequence always get inserted

I have a problem related to the hash tables.
Let's consider an hash table of dimension 2^n in a open linear schema.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence
{1,2,...2^n} always can be inserted into the hash table.
I tried to identify a pattern in the way the numbers get inserted into the table and then apply an induction to see if I can prove the question.Any problem which our teacher gave us seems to be like this one, and I can't figure out a way of doing these kind of problems.
h(k,i) = (k^n + 2*i)mod(2^n). Show that the sequence {1,2,...2^n} always can be inserted into the hash table.
Two observations about the hash function:
k^n, for n >= 1, will be odd when k is odd, and even when k is even
2*i will probe every second bucket (wrapping around from last to first)
So, as you hash {1,2,...2^n} we know you'll alternate between finding an unused odd-indexed bucket, and an even-indexed bucket.
Just to emphasise the point, the k^n bit restricts the odd keys to odd-indexed buckets and the even keys to even-indexed buckets, while 2*i ensures all such buckets are considered until a free one's found. It's necessary that exactly half the keys will be odd and half even for the table to become full without h(k,i) failing to find an unused bucket as i is incremented.
You have a lot of terminology problems here.
You hash table does not have dimensions (actually it has, but it is one dimension, and not 2^n), but it has number of slots/buckets.
Most probably the question you asked is not the question your book/teacher wants you to solve. You tell:
Show that the sequence {1,2,...2^n} always can be inserted into the
hash table
and the problem is that in your case any natural number can be inserted in your hash table. This is obvious, because your hash function maps any number to a natural number in a region from [0 to 2^n) and because your hash function has 2^n slots, any number will fit in your hash.
So clarify what your teacher wants, explain find out what k and i is in your hash function and ask another, better prepared question.

How do hashtable indexes work?

I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode();
int binIndex = hash % binCount;
// The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Create a second (larger) hash table.
Insert the new item into the new hash table.
Move some items from the existing hash table to the new one.
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.

Resources