Optimal data structure for a table - data-structures

Our team is working on implementation of the table widget for mobile platform (one of the application is mobile office like MS Excel).
We need to optimize the data structure for storing table data (the simple 2-d array is used).
Could you, please, suggest the optimal data structure for storing table data. Below are some of requirements for the data structure:
the size of the table can be up to 2^32 x 2^32;
majority of table cells are empty (i.e. the table is sparse), so is is desirable not to store data for empty cells;
interface of the data structure should support inserting/removing rows and columns;
data structure should allow to iterate through non-empty cells in forward and backward direction;
cells of the table can be merged (i.e. one cell can span more than one row and/or column).

After thinking more about the problem with the row/column insertion/deletion, I've come up with something that looks promising.
First, create and maintain 2 sorted data structures (e.g. search trees) containing all horizontal and all vertical indices that have at least one non-empty cell.
For this table:
ABCDE
1
2*
3 % #
4
5 $
You'd have:
A,B,D,E - used horizontal indices
2,3,5 - used vertical indices
Store those A,B,D,E,2,3,5 index values inside some kind of a node in the 2 aforementioned structures such that you can link something to it knowing the node's address in memory (again, a tree node fits perfectly).
In each cell (non-empty) have a pair of links to the index nodes describing its location (I'm using & to denote a link/reference to a node):
*: &2,&A
%: &3,&B
#: &3,&E
$: &5,&D
This is sufficient to define a table.
Now, how do we handle row/column insertion? We insert the new row/column index into the respective (horizontal or vertical) index data structure and update the index values after it (=to the right or below). Then we add new cells for this new row/column (if any) and link them to the appropriate index nodes.
For example, let's insert a row between rows 3 and 4 and add a cell with # in it at 4C (in the new row):
ABCDE
1
2*
3 % #
4 # <- new row 4
5 <- used to be row 4
6 $ <- used to be row 5
Your index structures are now:
A,B,C(new),D,E - used horizontal indices
2,3,4(new),6(used to be 5) - used vertical indices
The cells now link to the index nodes like this:
*: &2,&A - same as before
%: &3,&B - same as before
#: &3,&E - same as before
#: &4,&C - new cell linking to new index nodes 4 and C
$: &6,&D - used to be &5,&D
But look at the $ cell. It still points to the same two physical nodes as before, it's just that the vertical/row node now contains index 6 instead of index 5.
If there were 100 cells nodes below the $ cell, say occupying only 5 non-empty rows, you'd need to update only 5 indices in the row/vertical index data structure, not 100.
You can delete rows and columns in a similar fashion.
Now, to make this all useful, you also need to be able to locate every cell by its coordinates.
For that you can create another sorted data structure (again, possibly a search tree), where every key is a combination of the addresses of the index nodes and the value is the location of cell data (or the cell data itself).
With that, if you want to get to cell 3B, you find the nodes for 3 and B in the index data structures, take their addresses &3 and &B, combine them into &3*232+&B and use that as a key to locate the % cell in the 3rd data structure I've just defined. (Note: 232 is actually 2pointer size in bits and can vary from system to system.)
Whatever happens to other cells, the addresses &3 and &B in the %'s cell links will remain the same, even if the indices of the % cell change from 3B to something else.
You may develop iteration on top of this easily.
Merging should be feasible too, but I haven't focused on it.

I would suggest just storing Key Value pairs like you would in excel. For example think of your excel document has columns A - AA etc... and rows 1 - 256000...etc So just store the values that have date like in some type of key-value pairs.
For example:
someKeyValueStore = new KeyValueStore();
someData = new Cell(A1,"SomeValue");
someOtherData = new Cell(C2,"SomeOtherValue");
someKeyValueStore.AddKeyValuePair(someData);
someKeyValueStore.AddKeyValuePair(someOtherData);
In this case you don't have to care about empty cells at all. You just have access to the ones that are not empty. Of course you probably would want to keep track of the keys in a collection so you could easily see if you had a value for a particular key or not. But that is essentially the simplest way to handle it.

Related

Best DataStructure to implement Excel Spreadsheet

How we can implement Excel spreadsheet with creation and deletion of rows and creation and deletion if cells, with also can modify data inside any cell.
I was looking for best data structure to implement this.
The problem statement is little vague in my opinion. We do not have any information about the kind of operations that will be very frequent or even the amount of data that this DS is going to hold.
So assuming there can be fair amount of data. Also the operations are addition and deletion of rows and cells.
For excel spreadsheet, If I have to implement it with a custom Data Structure, I would take each row as a node of a linked list. This is helpful because as opposed to an array (n dimensional), the memory can be assigned in non contiguous manner. Also with that benefit, it will make adding and deletion of rows much easy.
Inside each node, we can have array of string to hold cell values and a Id field to hold the Id of the row.
The head node of the DS will have column names as value of its string array. So in a way each column is mapped to an index of the array.
To add a row: It will be an insert into the linked list. Make a new row and append in the end.
To delete a row: Same as deletion of node in a linked list.
To add/update a cell value: You basically know the row Id, you have column name so you can know the index of the column in the array from head node. So once you have the node corresponding to the row, access the index of string array to add/read/update/delete the value of cell.
In order to optimize node access you can keep indexes on the actual linked list to easily locate node by row Id. Some more optimizations would be store row-Id to node pointers mapping some where in auxiliary map or array so that inserting rows in between in also fast.
However I would re-iterate that implementation should be done on the use-case basis. If there are heavy column addition/deletion ops for example, it will be quite slow. There are different kind of trade-offs for each kind of use case.
I think the easy way to go ahead with this is to simply use a JSON structure to hold each row. Column names as keys and the cell values as values. This handles null/empty values quite easily.
A spreadsheet is essentially similar to a table, changes can be made on any cell at any row. Hence going with a simple list structure would not be too bad. The downside to this is that deletion and insertion of in between rows is not performant. But the insertion of rows at end, which is the most common use case and modification of cells can be made quite easy.
To facilitate faster insertion and deletion a linked list structure will help, but it will affect random access adversely, so a simple list of json objects would be the better.

Create two or three columns spreadsheet having multiple rows?

I understand that table and cells are supported only in PDFClown version 2.0 but that is only a few months away. So, being stuck with version 1.2, how do I create a spreadsheet having 2 columns (& another spreadsheet having 3 columns)?
Anything with examples to point me in the right direction.
As you noticed, the layout engine supporting tables and lots of other high-level typographic elements is scheduled for 0.2.0 (its Java implementation will be pre-released for evaluation and beta-testing); in the meantime you can coarsely arrange a table this way:
define the table partition (columns) on the page and draw the corresponding rectangles through the PrimitiveComposer;
insert in each column area your contents through BlockComposer, keeping track of the maximum y occupied by your contents (this is calculated when you call BlockComposer.End(), after which you can retrieve the bounding box of your contents via BlockComposer.BoundBox);
when you complete the columns for the current table row, use the maximum y saved in step 2 to draw the bottom line which closes the row and iterate back to step 2 until you run out of rows;
if you run out of space while inserting contents, keep track of the positions returned by BlockComposer.ShowText() and BlockComposer.ShowXObject(): this way you can fill each column, then move to the next page and resume the insertion according to the tracked positions.
This should suffice to get the job done. ;-)

Does a bitmap index create a replica of the original table?

In this youtube tutorial here
it seems that Bitmap index will always create a replica of the whole table when it creates the index. Because it creates the index and against each row, it puts 0 or 1. Is my understadning wrong?
The otehr thing is that towards the end of the tutorial it seems that bitmap index cannot operate on a != operator.
I thought that = and != seems the same to me from the POV of indexing.
Every row in the table is represented in a single bit (i.e. either 0 or 1), for at least one distinct value1. I'm not sure that could be considered a replica of the whole table, as that implies that all the data is replicated, and data in other columns is obviously not present. But it does contain data for the whole table, as every row is represented (probably multiple times, all but one with the bit set to zero).
The concepts guide explains what's happening:
Each bit in the bitmap corresponds to a possible rowid. If the bit is
set, then the row with the corresponding rowid contains the key value.
A mapping function converts the bit position to an actual rowid, so
the bitmap index provides the same functionality as a B-tree index
although it uses a different internal representation.
The storage structure is also explained.
Coupled with that, when you think of it as a two-dimentional array, it becomes clearer why every row has to be represented for each value. In the example in the documentation, the value for each row has to be represented by one of the distinct values, so a 'column' of the array has to have exactly one bit set to 1. There is no way to have a 'column' that is all zeros - if the column was nullable then null would be another value in the array and null columns in the table would have that bit set to 1 in the index - for a row in the table, so there it wouldn't make sense to not have every row represented.
You can have an array 'column' that is all zeros, but only for rows that don't exist. 'Each bit in the bitmap corresponds to a possible rowid', not necessarily to an actual row. From the storage description you can see that bitmaps are stored against ranges of rowids, and a rowid value in that range might not point to an actual row (in this table).
And that's what makes testing for inequality a problem. You can't just look at one 'row' of the array and say that anything in the 'M' row that is set to zero matches != 'M', because the rowid that bit represents might not actually be a row in the table at all. In a sense, a bit set to zero doesn't tell you anything definite; only a bit set to 1 does. So for an inequality condition, the whole index has to be checked to find values that are 1 for any other value.
1 - Logically every row is represented for every value, but the example data storage in the docs shows different rowid ranges for different values; I guess there's no point storing index data for a range where all the bits are zero, only for ranges where at least one bit is 1. But all rows are still represented in at least one index entry, as a bit set to 1 somewhere. I might be reading too much into their conceptual picture of what's stored.

Representing 2D data optimized by row vs by column vs flat

In D3 I need to visualize loading lab samples into plastic 2D plates of 8 rows x 12 columns or similar. Sometimes I load a row at a time, sometimes a column at a time, occasionally I load flat 1D 0..95, or other orderings. Should the base D3 data() structure nest rows in columns (or vice verse) or should I keep it one dimensional?
Representing the data optimized for columns [columns[rows[]] makes code complex when loading by rows, and vice versa. Representing it flat [0..95] is universal but it requires calculating all row and column references for 2D modes. I'd rather reference all orderings out of a common base but so far it's a win-lose proposition. I lean toward 1D flat and doing the math. Is there a win-win? Is there a way to parameterize or invert the ordering and have it optimized for both ways?
I believe in your case the best implementation would be an Associative array (specifically, a hash table implementation of it). Keys would be coordinates and values would be your stored data. Depending on your programming language you would need to handle keys in one way or another.
Example:
[0,0,0] -> someData(1,2,3,4,5)
[0,0,1] -> someData(4,2,4,6,2)
[0,0,2] -> someData(2,3,2,1,5)
Using a simple associative array would give you great insertion speeds and reading speeds, however code would become a mess if some complex selection of data blocks is required. In that case, using some database could be reasonable (though slower than a hashmap implementation of associative array). It would allow you to query some specific data in batches. For example, you could get whole row (or several rows) of data using one simple query:
SELECT * FROM data WHERE x=1 AND y=2 ORDER BY z ASC
Or, let's say selecting a 2x2x2 cube from the middle of 3d data:
SELECT * FROM data WHERE x>=5 AND x <=6 AND y>=10 AND Y<=11 AND z >=3 AND z <=4 ORDER BY x ASC, y ASC, z ASC
EDIT:
On a second thought, if the size of the dimensions wont change during runtime - you should go with a 1-dimentional array using all the math yourself, as it is the fastest solution. If you try to initialize a 3-dimentional arrays as array of arrays of arrays, every read/write to an element would require 2 additional hops in memory to find the required address. However, writing some function like:
int pos(w,h, x,y,z) {return z*w*h+y*w+x;} //w,h - dimensions, x,y,z, - position
Would make it inlined by most compilers and pretty fast.

Key assignment scheme for sorting rows in table

I'm looking for a scheme for assigning keys to rows in a table that would allow rows to be moved around and assigned new locations in the table without having to renumber the entire table.
Something like having keys 1, 2, 3, 4, then moving row "2" between 3 and 4 and then renaming it "3.5" (so you end up with 1, 3, 3.5, 4). But the scheme needs to be "infinitely" extensible (permitting at least a few thousand "random" row moves before it would be normally be necessary to "normalize" the keys, and worst (most pathological) case allowing 25-50 such moves).
And the keys produced should be easily sorted, ideally I'd like them to be "naturally" ordered for a database (assume SQLite) query.
Any ideas?
This problem reminds me of the line numbering problem when a person was writing code in BASIC. What most people did in this situation was take an educated guess on how many lines might be inserted in between two lines. Then that guess would be the spacing between those lines. So if you think you might have 2000 inserts between two elements, then you might make element1 have a key of 2000 and make element2 have a key of 4000. Then we you want to put an element between element1 or element2 you either naively split the difference (3000) or if you have some intuition about how many elements would go on each side of element3, then you might weight it some (i.e. 3500 instead of 3000).
Another alternative (its really just the same thing but you are using a different numbering system) is to use floating point numbers which I believe you eluded to. Between 1 and 2 would be 1.5. Between 1.5 and 2 would be 1.75. Between 1.5 and 1.75 would be 1.625, etc.
I would recommend against a key that is a string. It is better to stick with numeric keys, and on top of that it is probably better to have integer type keys rather than floating point type keys if you can help it.
Conceptually, you could treat your table like a linked list. Create a table with a unique ID, the key and it's next node and whatever other data you want. Simply insert items sequentially, when you need to put a new item in between, simply swap the key values and the associated parent nodes. The key values won't remain consistent, but that is what the additional unique ID is for and this works fine for ordering by the key as well.
Really, since you have order already specified by the key, you don't even need the 'next node'. Your scheme as described above should be fine as long as you rename the keys of the other nodes in addition to the one you moved - i.e., 2 and 3 get their key values swapped.

Resources