In this youtube tutorial here
it seems that Bitmap index will always create a replica of the whole table when it creates the index. Because it creates the index and against each row, it puts 0 or 1. Is my understadning wrong?
The otehr thing is that towards the end of the tutorial it seems that bitmap index cannot operate on a != operator.
I thought that = and != seems the same to me from the POV of indexing.
Every row in the table is represented in a single bit (i.e. either 0 or 1), for at least one distinct value1. I'm not sure that could be considered a replica of the whole table, as that implies that all the data is replicated, and data in other columns is obviously not present. But it does contain data for the whole table, as every row is represented (probably multiple times, all but one with the bit set to zero).
The concepts guide explains what's happening:
Each bit in the bitmap corresponds to a possible rowid. If the bit is
set, then the row with the corresponding rowid contains the key value.
A mapping function converts the bit position to an actual rowid, so
the bitmap index provides the same functionality as a B-tree index
although it uses a different internal representation.
The storage structure is also explained.
Coupled with that, when you think of it as a two-dimentional array, it becomes clearer why every row has to be represented for each value. In the example in the documentation, the value for each row has to be represented by one of the distinct values, so a 'column' of the array has to have exactly one bit set to 1. There is no way to have a 'column' that is all zeros - if the column was nullable then null would be another value in the array and null columns in the table would have that bit set to 1 in the index - for a row in the table, so there it wouldn't make sense to not have every row represented.
You can have an array 'column' that is all zeros, but only for rows that don't exist. 'Each bit in the bitmap corresponds to a possible rowid', not necessarily to an actual row. From the storage description you can see that bitmaps are stored against ranges of rowids, and a rowid value in that range might not point to an actual row (in this table).
And that's what makes testing for inequality a problem. You can't just look at one 'row' of the array and say that anything in the 'M' row that is set to zero matches != 'M', because the rowid that bit represents might not actually be a row in the table at all. In a sense, a bit set to zero doesn't tell you anything definite; only a bit set to 1 does. So for an inequality condition, the whole index has to be checked to find values that are 1 for any other value.
1 - Logically every row is represented for every value, but the example data storage in the docs shows different rowid ranges for different values; I guess there's no point storing index data for a range where all the bits are zero, only for ranges where at least one bit is 1. But all rows are still represented in at least one index entry, as a bit set to 1 somewhere. I might be reading too much into their conceptual picture of what's stored.
Related
I understood the basic rationale for a reverse key index that it will reduce index contention. Now if I have 3 numbers in the index: 12345, 27999, 30632, i can see that if i reverse these numbers, the next number in the sequence won't always hit the same leaf block.
But if the numbers were like :12345,12346,12347, then the next numbers 12348,12349 (incremented by 1) would hit the same leaf block even if the index is reversed:
54321,64321,74321,84321,94321.
So how is the reverse index helping me? It was supposed to help particularly while using sequences
If we're talking about a sequence-generated value, you can't look at 5 values and draw too many conclusions. You need to think about the data that has already been inserted and the data that will be inserted in the future.
Assuming that your sequence started at 12345, the first 5 values would be inserted sequentially. But then the sixth value will be 12350. Reverse that and you get 05321 which would go to the far left of the index. Then you'd generate 12351. Reverse that to get 15321 and that's again toward the left-hand side of the index between the first value you generated (54321) and the most recent value (05321). As the sequence generates new values, they'll go further to the right until everything resets every 10 numbers and you're inserting into the far left-hand side of the index again.
I've been having some difficulty scaling up the application and decided to ask a question here.
Consider a relational database (say mysql). Let's say it allows users to make posts and these are stored in the post table (has fields: postid, posterid, data, timestamp). So, when you go to retrieve all posts by you sorted by recency, you simply get all posts with posterid = you and order by date. Simple enough.
This process will use timestamp as the index since it has the highest cardinality and correctly so. So, beyond looking into the indexes, it'll take literally 1 row fetch from disk to complete this task. Awesome!
But let's say it's been 1 million more posts (in the system) by other users since you last posted. Then, in order to get your latest post, the database will peg the index on timestamp again, and it's not like we know how many posts have happened since then (or should we at least manually estimate and set preferred key)? Then we wasted looking into a million and one rows just to fetch a single row.
Additionally, a set of posts from multiple arbitrary users would be one of the use cases, so I cannot make fields like userid_timestamp to create a sub-index.
Am I seeing this wrong? Or what must be changed fundamentally from the application to allow such operation to occur at least somewhat efficiently?
Indexing
If you have a query: ... WHERE posterid = you ORDER BY timestamp [DESC], then you need a composite index on {posterid, timestamp}.
Finding all posts of a given user is done by a range scan on the index's leading edge (posterid).
Finding user's oldest/newest post can be done in a single index seek, which is proportional to the B-Tree height, which is proportional to log(N) where N is number of indexed rows.
To understand why, take a look at Anatomy of an SQL Index.
Clustering
The leafs of a "normal" B-Tree index hold "pointers" (physical addresses) to indexed rows, while the rows themselves reside in a separate data structure called "table heap". The heap can be eliminated by storing rows directly in leafs of the B-Tree, which is called clustering. This has its pros and cons, but if you have one predominant kind of query, eliminating the table heap access through clustering is definitely something to consider.
In this particular case, the table could be created like this:
CREATE TABLE T (
posterid int,
`timestamp` DATETIME,
data VARCHAR(50),
PRIMARY KEY (posterid, `timestamp`)
);
The MySQL/InnoDB clusters all its tables and uses primary key as clustering key. We haven't used the surrogate key (postid) since secondary indexes in clustered tables can be expensive and we already have the natural key. If you really need the surrogate key, consider making it alternate key and keeping the clustering established through the natural key.
For queries like
where posterid = 5
order by timestamp
or
where posterid in (4, 578, 222299, ...etc...)
order by timestamp
make an index on (posterid, timestamp) and the database should pick it all by itself.
edit - i just tried this with mysql
CREATE TABLE `posts` (
`id` INT(11) NOT NULL,
`ts` INT NOT NULL,
`data` VARCHAR(100) NULL DEFAULT NULL,
INDEX `id_ts` (`id`, `ts`),
INDEX `id` (`id`),
INDEX `ts` (`ts`),
INDEX `ts_id` (`ts`, `id`)
)
ENGINE=InnoDB
I filled it with a lot of data, and
explain
select * from posts where id = 5 order by ts
picks the id_ts index
Assuming you use hash tables to implement your Data Base - yes. Hash tables are not ordered, and you have no other way but to iterate all elements in order to find the maximal.
However, if you use some ordered DS, such as a B+ tree (which is actually pretty optimized for disks and thus data bases), it is a different story.
You can store elements in your B+ tree ordered by user (primary order/comparator) and date (secondary comparator, descending). Once you have this DS, finding the first element can be achieved in O(log(n)) disk seeks by finding the first element matching the primary criteria (user-id).
I am not familiar with the implementations of data bases, but AFAIK, some of them do allow you to create an index, based on a B+ tree - and by doing so, you can achieve finding the last post of a user more efficiently.
P.S.
To be exact, the concept of "greatest" element or ordering is not well defined in Relational Algebra. There is no max operator. To get the max element of a table R with a single column a one should actually create the Cartesian product of that table and find this entry. There is no max nor sort operator in strict relational algebra (though it does exist in SQL)
(Assuming set, and not multiset semantics):
MAX = R \ Project(Select(R x R, R1.a < R2.a),R1.a)
Our team is working on implementation of the table widget for mobile platform (one of the application is mobile office like MS Excel).
We need to optimize the data structure for storing table data (the simple 2-d array is used).
Could you, please, suggest the optimal data structure for storing table data. Below are some of requirements for the data structure:
the size of the table can be up to 2^32 x 2^32;
majority of table cells are empty (i.e. the table is sparse), so is is desirable not to store data for empty cells;
interface of the data structure should support inserting/removing rows and columns;
data structure should allow to iterate through non-empty cells in forward and backward direction;
cells of the table can be merged (i.e. one cell can span more than one row and/or column).
After thinking more about the problem with the row/column insertion/deletion, I've come up with something that looks promising.
First, create and maintain 2 sorted data structures (e.g. search trees) containing all horizontal and all vertical indices that have at least one non-empty cell.
For this table:
ABCDE
1
2*
3 % #
4
5 $
You'd have:
A,B,D,E - used horizontal indices
2,3,5 - used vertical indices
Store those A,B,D,E,2,3,5 index values inside some kind of a node in the 2 aforementioned structures such that you can link something to it knowing the node's address in memory (again, a tree node fits perfectly).
In each cell (non-empty) have a pair of links to the index nodes describing its location (I'm using & to denote a link/reference to a node):
*: &2,&A
%: &3,&B
#: &3,&E
$: &5,&D
This is sufficient to define a table.
Now, how do we handle row/column insertion? We insert the new row/column index into the respective (horizontal or vertical) index data structure and update the index values after it (=to the right or below). Then we add new cells for this new row/column (if any) and link them to the appropriate index nodes.
For example, let's insert a row between rows 3 and 4 and add a cell with # in it at 4C (in the new row):
ABCDE
1
2*
3 % #
4 # <- new row 4
5 <- used to be row 4
6 $ <- used to be row 5
Your index structures are now:
A,B,C(new),D,E - used horizontal indices
2,3,4(new),6(used to be 5) - used vertical indices
The cells now link to the index nodes like this:
*: &2,&A - same as before
%: &3,&B - same as before
#: &3,&E - same as before
#: &4,&C - new cell linking to new index nodes 4 and C
$: &6,&D - used to be &5,&D
But look at the $ cell. It still points to the same two physical nodes as before, it's just that the vertical/row node now contains index 6 instead of index 5.
If there were 100 cells nodes below the $ cell, say occupying only 5 non-empty rows, you'd need to update only 5 indices in the row/vertical index data structure, not 100.
You can delete rows and columns in a similar fashion.
Now, to make this all useful, you also need to be able to locate every cell by its coordinates.
For that you can create another sorted data structure (again, possibly a search tree), where every key is a combination of the addresses of the index nodes and the value is the location of cell data (or the cell data itself).
With that, if you want to get to cell 3B, you find the nodes for 3 and B in the index data structures, take their addresses &3 and &B, combine them into &3*232+&B and use that as a key to locate the % cell in the 3rd data structure I've just defined. (Note: 232 is actually 2pointer size in bits and can vary from system to system.)
Whatever happens to other cells, the addresses &3 and &B in the %'s cell links will remain the same, even if the indices of the % cell change from 3B to something else.
You may develop iteration on top of this easily.
Merging should be feasible too, but I haven't focused on it.
I would suggest just storing Key Value pairs like you would in excel. For example think of your excel document has columns A - AA etc... and rows 1 - 256000...etc So just store the values that have date like in some type of key-value pairs.
For example:
someKeyValueStore = new KeyValueStore();
someData = new Cell(A1,"SomeValue");
someOtherData = new Cell(C2,"SomeOtherValue");
someKeyValueStore.AddKeyValuePair(someData);
someKeyValueStore.AddKeyValuePair(someOtherData);
In this case you don't have to care about empty cells at all. You just have access to the ones that are not empty. Of course you probably would want to keep track of the keys in a collection so you could easily see if you had a value for a particular key or not. But that is essentially the simplest way to handle it.
I'm looking for a scheme for assigning keys to rows in a table that would allow rows to be moved around and assigned new locations in the table without having to renumber the entire table.
Something like having keys 1, 2, 3, 4, then moving row "2" between 3 and 4 and then renaming it "3.5" (so you end up with 1, 3, 3.5, 4). But the scheme needs to be "infinitely" extensible (permitting at least a few thousand "random" row moves before it would be normally be necessary to "normalize" the keys, and worst (most pathological) case allowing 25-50 such moves).
And the keys produced should be easily sorted, ideally I'd like them to be "naturally" ordered for a database (assume SQLite) query.
Any ideas?
This problem reminds me of the line numbering problem when a person was writing code in BASIC. What most people did in this situation was take an educated guess on how many lines might be inserted in between two lines. Then that guess would be the spacing between those lines. So if you think you might have 2000 inserts between two elements, then you might make element1 have a key of 2000 and make element2 have a key of 4000. Then we you want to put an element between element1 or element2 you either naively split the difference (3000) or if you have some intuition about how many elements would go on each side of element3, then you might weight it some (i.e. 3500 instead of 3000).
Another alternative (its really just the same thing but you are using a different numbering system) is to use floating point numbers which I believe you eluded to. Between 1 and 2 would be 1.5. Between 1.5 and 2 would be 1.75. Between 1.5 and 1.75 would be 1.625, etc.
I would recommend against a key that is a string. It is better to stick with numeric keys, and on top of that it is probably better to have integer type keys rather than floating point type keys if you can help it.
Conceptually, you could treat your table like a linked list. Create a table with a unique ID, the key and it's next node and whatever other data you want. Simply insert items sequentially, when you need to put a new item in between, simply swap the key values and the associated parent nodes. The key values won't remain consistent, but that is what the additional unique ID is for and this works fine for ordering by the key as well.
Really, since you have order already specified by the key, you don't even need the 'next node'. Your scheme as described above should be fine as long as you rename the keys of the other nodes in addition to the one you moved - i.e., 2 and 3 get their key values swapped.
I'm currently doing some data loading for a kind of warehouse solution. I get an data export from the production each night, which then must be loaded. There are no other updates on the warehouse tables. To only load new items for a certain table I'm currently doing the following steps:
get the current max value y for a specific column (id for journal tables and time for event tables)
load the data via a query like where x > y
To avoid performance issues (I load around 1 million rows per day) I removed most indices from the tables (there are only needed for production, not in the warehouse). But that way the retrieval of the max value takes some time...so my question is:
What is the best way to get the current max value for a column without an index on that column? I just read about using the stats but I don't know how to handle columns with 'timestamp with timezone'. Disabling the index before load, and recreate it afterwards takes much too long...
The minimum and maximum values that are computed as part of column-level statistics are estimates. The optimizer only needs them to be reasonably close, not completely accurate. I certainly wouldn't trust them as part of a load process.
Loading a million rows per day isn't terribly much. Do you have an extremely small load window? I'm a bit hard-pressed to believe that you can't afford the cost of indexing the row(s) you need to do a min/ max index scan.
If you want to avoid indexes, however, you probably want to store the last max value in a separate table that you maintain as part of the load process. After you load rows 1-1000 in table A, you'd update the row in this summary table for table A to indicate that the last row you've processed is row 1000. The next time in, you would read the value from the summary table and start at 1001.
If there is no index on the column, the only way for the DBMS to find the maximum value in the column is a complete table scan, which takes a long time for large tables.
I suppose a DBMS could try to keep track of the minimum and maximum values in the column (storing the values in the system catalog) as it does inserts, updates and deletes - but deletes are why no DBMS I know of tries to keep statistics up to date with per-row operations. If you delete the maximum value, finding the new maximum requires a table scan if the column is not indexed (and if it is indexed, the index makes it trivial to find the maximum value, so the information does not have to be stored in the system catalog). This is why they're called 'statistics'; they're an approximation to the values that apply. But when you request 'SELECT MAX(somecol) FROM sometable', you aren't asking for statistical maximum; you're asking for the actual current maximum.
Have the process that creates the extract file also extract a single row file with the min/max you want. I assume that piece is scripted on some cron or scheduler, so shouldn't be too much to ask to add min/max calcs to that script ;)
If not, just do a full scan. Million rows isn't much really, esp in a data warehouse environment.
This code was written with oracle, but should be compatible with most SQL versions:
This gets the key of the max(high_val) in the table according to the range.
select high_val, my_key
from (select high_val, my_key
from mytable
where something = 'avalue'
order by high_val desc)
where rownum <= 1
What this says is: Sort mytable by high_val descending for values where something = 'avalue'. Only grab the top row, which will provide you with the max(high_val) in the selected range and the my_key to that table.