Does a reverse key index help if i use an incremental sequence to insert subsequent values - oracle

I understood the basic rationale for a reverse key index that it will reduce index contention. Now if I have 3 numbers in the index: 12345, 27999, 30632, i can see that if i reverse these numbers, the next number in the sequence won't always hit the same leaf block.
But if the numbers were like :12345,12346,12347, then the next numbers 12348,12349 (incremented by 1) would hit the same leaf block even if the index is reversed:
54321,64321,74321,84321,94321.
So how is the reverse index helping me? It was supposed to help particularly while using sequences

If we're talking about a sequence-generated value, you can't look at 5 values and draw too many conclusions. You need to think about the data that has already been inserted and the data that will be inserted in the future.
Assuming that your sequence started at 12345, the first 5 values would be inserted sequentially. But then the sixth value will be 12350. Reverse that and you get 05321 which would go to the far left of the index. Then you'd generate 12351. Reverse that to get 15321 and that's again toward the left-hand side of the index between the first value you generated (54321) and the most recent value (05321). As the sequence generates new values, they'll go further to the right until everything resets every 10 numbers and you're inserting into the far left-hand side of the index again.

Related

How does the leaf node split in the physical space in innodb?

If the keys are inserted in ascending order, according to normal B+-tree characteristics, when the leaf page is full, it will split and there will be a new page introduced to the B+-tree.
For an instance, if there is a leaf page with up to 3 keys.
(page0)|1|2|3|
Then the key 4 is inserted:
|1|3|*|(page0)
(page1)|1|2|*| |3|4|*|(page2)
After this, later keys will be inserted into page2 until the next split since they are in ascending order. All previous pages will remain half full.
In my example, I guess this will cause space to be wasted. However, in the database, it seems to be unreasonable. This really confuses me. I've read Jeremy Cole-B+Tree index structures in InnoDB, but I have probably misunderstood something.
Without additional optimizations, you're absolutely correct that as an index page filled it would be split in half and then remain half-filled forever. However, InnoDB optimizes index fill based on its perception of the insertion order. That is, if it detects that insertion is being done in-order (ascending or descending) it will, instead of splitting a page in half, just create a new empty page for an insertion at the "edge" of the page.
There is some information about this in the MySQL manual section The Physical Structure of an InnoDB Index. Additionally I illustrate an example of this behavior in my post Visualizing the impact of ordered vs. random index insertion in InnoDB.
In The physical structure of InnoDB index pages I describe the Last Insert Position, Page Direction, and Number of Inserts in Page Direction fields of each index page. This is how the tracking for ascending vs. descending order is done (as left vs. right, though). With each insert, the last inserted record is compared to the currently inserted one, and if the insert is in the same "direction", the counter is incremented. This counter is then checked to determine the page split behavior; whether to split in half or create a new, empty page.
In practice, this optimization is not perfect, and there's a big difference between insertions being mostly in-order, and exactly in-order. If inserts are only mostly in-order it can mean that the page direction may never get appropriately set, and pages will end up half-filled (as you described).

Search data from a data set without reading each element

I have just started learning algorithms and data structures and I came by an interesting problem.
I need some help in solving the problem.
There is a data set given to me. Within the data set are characters and a number associated with each of them. I have to evaluate the sum of the largest numbers associated with each of the present characters. The list is not sorted by characters however groups of each character are repeated with no further instance of that character in the data set.
Moreover, the largest number associated with each character in the data set always appears at the largest position of reference of that character in the data set. We know the length of the entire data set and we can get retrieve the data by specifying the line number associated with that data set.
For Eg.
C-7
C-9
C-12
D-1
D-8
A-3
M-67
M-78
M-90
M-91
M-92
K-4
K-7
K-10
L-13
length=15
get(3)= D-1(stores in class with character D and value 1)
The answer for the above should be 13+10+92+3+8+12 as they are the highest numbers associated with L,K,M,A,D,C respectively.
The simplest solution is, of course, to go through all of the elements but what is the most efficient algorithm(reading the data set lesser than the length of the data set)?
You'll have to go through them each one by one, since you can't be certain what the key is.
Just for sake of easy manipulation, I would loop over the dataset and check if the key at index i is equal to the index at i+1, if it's not, that means you have a local max.
Then, store that value into a hash or dictionary if there's not already an existing key:value pair for that key, if there is, do a check to see if the existing value is less than the current value, and overwrite it if true.
While you could use statistics to optimistically skip some entries - say you read A 1, you skip 5 entries you read A 10 - good. You skip 5 more, B 3, so you need to go back and also read what is inbetween.
But in reality it won't work. Not on text.
Because IO happens in blocks. Data is stored in chunks of usually around 8k. So that is the minimum read size (even if your programming language may provide you with other sized reads, they will eventually be translated to reading blocks and buffering them).
How do you find the next line? Well you read until you find a \n...
So you don't save anything on this kind of data. It would be different if you had much larger records (several KB, like files) and an index. But building that index will require reading all at least once.
So as presented, the fastest approach would likely be to linearly scan the entire data once.

Does a bitmap index create a replica of the original table?

In this youtube tutorial here
it seems that Bitmap index will always create a replica of the whole table when it creates the index. Because it creates the index and against each row, it puts 0 or 1. Is my understadning wrong?
The otehr thing is that towards the end of the tutorial it seems that bitmap index cannot operate on a != operator.
I thought that = and != seems the same to me from the POV of indexing.
Every row in the table is represented in a single bit (i.e. either 0 or 1), for at least one distinct value1. I'm not sure that could be considered a replica of the whole table, as that implies that all the data is replicated, and data in other columns is obviously not present. But it does contain data for the whole table, as every row is represented (probably multiple times, all but one with the bit set to zero).
The concepts guide explains what's happening:
Each bit in the bitmap corresponds to a possible rowid. If the bit is
set, then the row with the corresponding rowid contains the key value.
A mapping function converts the bit position to an actual rowid, so
the bitmap index provides the same functionality as a B-tree index
although it uses a different internal representation.
The storage structure is also explained.
Coupled with that, when you think of it as a two-dimentional array, it becomes clearer why every row has to be represented for each value. In the example in the documentation, the value for each row has to be represented by one of the distinct values, so a 'column' of the array has to have exactly one bit set to 1. There is no way to have a 'column' that is all zeros - if the column was nullable then null would be another value in the array and null columns in the table would have that bit set to 1 in the index - for a row in the table, so there it wouldn't make sense to not have every row represented.
You can have an array 'column' that is all zeros, but only for rows that don't exist. 'Each bit in the bitmap corresponds to a possible rowid', not necessarily to an actual row. From the storage description you can see that bitmaps are stored against ranges of rowids, and a rowid value in that range might not point to an actual row (in this table).
And that's what makes testing for inequality a problem. You can't just look at one 'row' of the array and say that anything in the 'M' row that is set to zero matches != 'M', because the rowid that bit represents might not actually be a row in the table at all. In a sense, a bit set to zero doesn't tell you anything definite; only a bit set to 1 does. So for an inequality condition, the whole index has to be checked to find values that are 1 for any other value.
1 - Logically every row is represented for every value, but the example data storage in the docs shows different rowid ranges for different values; I guess there's no point storing index data for a range where all the bits are zero, only for ranges where at least one bit is 1. But all rows are still represented in at least one index entry, as a bit set to 1 somewhere. I might be reading too much into their conceptual picture of what's stored.

Sorting application difficulty

Currently I am reading a book on algorithms and found this usage of sorting.
Reconstructing the original order - How can we restore the original arrangment of a set of items after we permute them for some application? Add an extra field to the data record for the item, such that i-th record sets this field to i. Carry this field along whenever you move the record, and later sort on it when you want the initial order back.
I ve been trying hard to understand what does it mean. And I failed miserably. Pls somebody help?
Suppose you have list of items in random order:
itemC, itemB, itemA, itemD
you sorted them up:
itemA, itemB, itemC, itemD
and you didn't have enough memory to store them in a separate location, so original sequence is lost. Moreover, original order is random and it will be problematic/impossible to restore it.
This article gives a solution to this problem.
Add an extra field to the data record for the item, such that i-th record sets this field to i
So, we add an extra field for each of the items:
(itemC,1), (itemB,2), (itemA,3), (itemD, 4)
And after sort we have:
(itemA,3), (itemB,2), (itemC,1), (itemD, 4)
So we can easily restore initial order sorting by additional field
Let's say you have the data in an array, because it's the simplest structure that I can use to exemplify.
So, your node (i.e., element of the array) may look like this:
(some data type) data
The algorithm is suggesting you to add an integer field, so it looks like this:
(some data type) data,
int position
And then, you fill the positions with the actual index. Something like this pseudocode:
for current: 0 to lastElement
array[current].position = current
(that's not written in any language I know of, but it should be readable)
After doing that, you shuffle it (resort it) for whatever you need to.
When you want to restore the original ordering, all you need to do is sort by the position field.
Well, basically it's saying that you need some sort of thingy to keep track of the original order (which is destroyed by the permutation). One option would be to simply reverse the permutation (check out Steve Jessop's infrmative answer here).
Another option to invert the permutation would require fewer processing steps, but more memory. More specifically, each node in your input set would have an extra ID field, and all the elements in this input set are sorted based on this field. Once you apply the permutation, it's obvious that the IDs are no longer in a sorted order. If you wish to invert the permutation, all you have to do is sort the list again based on this field.

Key assignment scheme for sorting rows in table

I'm looking for a scheme for assigning keys to rows in a table that would allow rows to be moved around and assigned new locations in the table without having to renumber the entire table.
Something like having keys 1, 2, 3, 4, then moving row "2" between 3 and 4 and then renaming it "3.5" (so you end up with 1, 3, 3.5, 4). But the scheme needs to be "infinitely" extensible (permitting at least a few thousand "random" row moves before it would be normally be necessary to "normalize" the keys, and worst (most pathological) case allowing 25-50 such moves).
And the keys produced should be easily sorted, ideally I'd like them to be "naturally" ordered for a database (assume SQLite) query.
Any ideas?
This problem reminds me of the line numbering problem when a person was writing code in BASIC. What most people did in this situation was take an educated guess on how many lines might be inserted in between two lines. Then that guess would be the spacing between those lines. So if you think you might have 2000 inserts between two elements, then you might make element1 have a key of 2000 and make element2 have a key of 4000. Then we you want to put an element between element1 or element2 you either naively split the difference (3000) or if you have some intuition about how many elements would go on each side of element3, then you might weight it some (i.e. 3500 instead of 3000).
Another alternative (its really just the same thing but you are using a different numbering system) is to use floating point numbers which I believe you eluded to. Between 1 and 2 would be 1.5. Between 1.5 and 2 would be 1.75. Between 1.5 and 1.75 would be 1.625, etc.
I would recommend against a key that is a string. It is better to stick with numeric keys, and on top of that it is probably better to have integer type keys rather than floating point type keys if you can help it.
Conceptually, you could treat your table like a linked list. Create a table with a unique ID, the key and it's next node and whatever other data you want. Simply insert items sequentially, when you need to put a new item in between, simply swap the key values and the associated parent nodes. The key values won't remain consistent, but that is what the additional unique ID is for and this works fine for ordering by the key as well.
Really, since you have order already specified by the key, you don't even need the 'next node'. Your scheme as described above should be fine as long as you rename the keys of the other nodes in addition to the one you moved - i.e., 2 and 3 get their key values swapped.

Resources