Usually ‘expandable’ grids are represented as list of lists (list of rows, each row has list of cells) and those lists are some kind of linked lists.
Manipulating (removing, inserting) rows in this data structure is easy and inexpensive, it’s just matter of re-linking previous nodes, but when it comes to columns, for instance removing a column it become a very long operation, I need to ‘loop’ all rows to remove that indexes cells. Clearly this isn’t good behavior, at least for my case.
I’m not talking databases here; a good example I’ve found for this is some text file into a text editor, (as I know) text editors mostly splitting lines into rows and it’s easy to remove line. I want removing a column is as inexpensive and efficient as removing some row.
Finally, what I need is some Multi-dimensional grid but I think of any 2d simple grid would be applicable for MD, Am I right?
You could have a two dimensional "linked matrix" (I forget the proper terminology):
... Col 3 ... Col 4 ...
| |
... --X-- ... --Y-- ...
| |
... ... ... ... ...
Each cell has four neighbours, as shown. Additionally you need row and column headers that might indicate the row/column position, as well as pointing to the first cell in each row or column. These are most easily represented as special cells without an up neighbour (for column headers).
Inserting a new column between 3 and 4 means iterating down the cells X in col 3, and inserting a new right neighbour Z. This new cell Z links leftward to X and rightward to Y. You also need to add a new column header, and link the new cells vertically. Then the positions of all the columns after 4 can be renumbered (col 4 becomes col 5).
... Col 3 Col 4 Col 5 ...
| | |
... --X-----Z-----Y-- ...
| | |
... ... ... ... ...
The cost of inserting a column is O(n) for inserting and linking the new cells, and again O(m) for updating the column headers. It's a similar process for deletion.
Because each cell is just four links, the same algorithms are used for row insertion/deletion.
Keep your existing data structure as is. In addition, give each column a unique id when it is created. When you delete the column, just add its id to a hash table of all deleted column ids. Every time you walk a row, check each element's column id (which needs to be stored along with all other data for an element) against the hash table and splice it out of the row if it has been deleted.
The hash table and ids are unnecessary if you have a per-column data structure that each grid element can point to. Then you just need a deleted bit in that data structure.
By the way, Edmund's scheme would be fine for you as well. Even though it takes O(n) time to delete a row or column of length n, you can presumably amortize that cost against the cost of creating those n elements, making the delete O(1) amortized time.
I know that "Linked-Lists" are usually appreciated from a theoretical point of view, but in practice they are generally inefficient.
I would suggest moving toward Random Access Containers to get some speed. The most simple would be an array, but a double-ended queue or an indexed skip list / B* tree could be better, depending on the data size we are talking about.
Conceptually, it doesn't change much (yet), however you get the ability to move to a given index in O(1) (array, deque) / O(log N) (skip list / B* tree) operations, rather than O(N) with a simple linked-list.
And then it's time for magic.
Keith has already exposed the basic idea: rather than actually deleting the column, you just need to mark it as deleted and then 'jump' above it when you walk your structure. However a hash table requires a linear walk to get to the Nth column. Using a Fenwick Tree would yield an efficient way to compute the real index, and you could then jump directly there.
Note that a key benefit of marking a row as deleted is the obvious possibility of an undo operation.
Also note that you might want to build a compacting function, to eliminate the deleted columns from time to time, and not let them accumulate.
Related
I am looking for the approach or algorithm that can help with the following requirements:
Partition the elements into a defined number of X partitions. Number of partitions might be redefined manually over time if needed.
Each partition should not have more than Y elements
Elements have a "category Id" and "element Id". Ideally all elements with the same category Id should be within the same partition. They should overflow to as few partitions as possible only if a given category has more than Y elements. Number of categories is orders of magnitude larger than number of partitions.
If the element from the set has been previously assigned to a given partition it should continue being assigned to the same partition
Account for change in the data. Existing elements might be removed and new elements within each of the categories can be added.
So far my naive approach is to:
sort the categories descending by their number of elements
keep a variable with a count-of-elements for a given partition
assign the rows from the first category to the first partition and increase the count-of-elements
if count-of-elements > Y: assign overflowing elements to the next partition, but only if the number of elements in a category is bigger than Y. Otherwise assign all elements from a given category to the next partition
continue till all elements are assigned to partitions
In order to persist the assignments store in the database all pairs: (element Id, partition Id)
On the consecutive re-assignments:
remove from the database any elements that were deleted
assign existing elements to the partitions based on (element Id, partition Id)
for any new elements follow the above algorithm
My main worry is that after few such runs we will end up with categories spread all across the partitions as the initial partitions will get all filled in. Perhaps adding a buffer (of 20% or so) to Y might help. Also if one of the categories will see a sudden increase in a number of elements the partitions will need rebalancing.
Are there any existing algorithms that might help here?
This is NP hard (knapsack) on NP hard (finding optimal way to split too large categories) on currently unknown because of future data changes. Obviously the best that you can do is a heuristic.
Sort the categories by descending size. Using a heap/priority queue for the partitions, put each category into the least full available partition. If the category won't fit, then split it as evenly as you can into the smallest number of possible partitions. My guess (experiment!) is that trying to leave partitions at the same fill is best.
On reassignment, delete the deleted elements first. Then group new elements by category. Sort the categories by how many preferred locations they have ascending, and then by descending size. Now move the categories with 0 preferred locations to the end.
For each category, if possible split its new elements across the preferred partitions, leaving them equally full. If this is not possible, put them into the emptiest possible partition. If that is not possible, then split them to put them across the fewest possible partitions.
It is, of course, possible to come up with data sets that eventually turn this into a mess. But it makes a pretty good good faith effort to try to come out well.
I'm new to data structures. I have been working for the past 72 hours to find an algorithm to insert a particular value into a singly linked list based on row and column index. I created the singly linked list based on the SPARSE MATRIX below.
I have attached the image of the linked list above. For an example, if i wanted to insert a value at row 0 and column 4 with the value of 8. What is the most suitable algorithm to make this happen ? Thanks in advance guys
An interesting point to consider.
First if you flatten your matrix, then you can notice that
the cell at (0,1) (at row 0 and column 1) becomes the cell of index 1.
the cell at (1,0) becomes the cell at index 5.
More generally, the cell (i,j) becomes the cell at index i * row_size + j
Using this observation you can go through the list until the computed index of the cell you want to insert is smaller than the computed index of the current element.
If you know how to insert a node at a specific position in a linked list (which i recommend you try first if you dont) it should be easy to make the bridge between the two.
I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.
The problem is to come up with a data structure that can work with a giant excel sheet (obviously does not fit into the main memory as it is )
Imagine the following as part of excel sheet where e represents an empty cell.
A B C D ...
1 3 9 e e ...
2 e e e e ...
3 e e 5 e ...
4 e e e e ...
5 e e 6 e ...
So the data structure should allow me to store the excel sheet into the memory (we know that only values in the excel sheet fit into the main memory) and support the following operations
getByColumn(Column col); - gives all values of a certain column, say 5,6 for Column C
getByRow(Row row); - gives all values of a certain row, say 3 and 9 and more for ROW 1
insertCell(Column col, Row row, int value); - inserts or overrides the value of a cell
getExcelSheet(FileName); - gives the whole excel sheet in a compressed form (data structure)
What is a thinkable data structure for this? I am preparing for an interview and this is no homework. WOuld like to gain some insights from different folks.
Just to give a sense: Say the excel sheet is 1 Terabyte, we have 8GB of memory. 1 terabyte excel sheet just have many many empty cells but values spread all over the different cells
Use a Map/Dictionary mapping cell coordinates to values, returning a default value of EMPTY_CELL for everything not explicitely set.
Implement the desired methods based on that.
There is an extensive literature on the topic of sparse matrices, which is a widely-used term for what you call a giant Excel sheet. The literature covers both data structures and suitable algorithms for creating and modifying them; the Wikipedia article provides a good starting point for your research. It may tell you enough to prepare yourself for your interview.
An elaboration of Tass' comment and Mark's answer (for which +1):
You can insert cell values efficiently if you use what wikipedia calls Dictionary Of Keys or DOK (which is essentially Jens' answer), but as you rightly comment, getByRow and getByColumn will be fairly slow.
A better option would be what wikipedia calls Coordinate List or COO: just a set of triples (rowindex, columnindex, value). You'd probably actually store this as three arrays. In order to make insertion fast, keep a set of sorted and unsorted entries, and insert into the unsorted set; whenever the number of unsorted entries goes over a threshold T (which might depend on the total number of nonempty cells K), sort them into the sorted set.
You'll want to sort them all by, say, row index, and keep another array with indices into the arrays to give the version that is sorted by column index.
For getByRow you would take the correct section of the arrays sorted by row index, and additionally search through the unsorted set.
All of this assumes that you do have enough memory to store a couple of words for every nonempty entry in the matrix. If not, you'll need to combine this with some sort of external memory approach.
You could store this magic excel sheet in a two dimensional array, with empty cells containing null. If the data won't fit in that either I think we're out of luck
Intro
Consider you have a list of key/value pairs:
(0,a) (1,b) (2,c)
You have a function, that inserts a new value between two current pairs, and you need to give it a key that keeps the order:
(0,a) (0.5,z) (1,b) (2,c)
Here the new key was chosen as the average between the average of keys of the bounding pairs.
The problem is, that you list may have milions of inserts. If these inserts are all put close to each other, you may end up with keys such to 2^(-1000000), which are not easily storagable in any standard nor special number class.
The problem
How can you design a system for generating keys that:
Gives the correct result (larger/smaller than) when compared to all the rest of the keys.
Takes up only O(logn) memory (where n is the number of items in the list).
My tries
First I tried different number classes. Like fractions and even polynomium, but I could always find examples where the key size would grow linear with the number of inserts.
Then I thought about saving pointers to a number of other keys, and saving the lower/greater than relationship, but that would always require at least O(sqrt) memory and time for comparison.
Extra info: Ideally the algorithm shouldn't break when pairs are deleted from the list.
I agree with snowlord. A tree would be ideal in this case. A red-black tree would prevent things from getting unbalanced. If you really need keys, though, I'm pretty sure you can't do better than using the average of the keys on either side of the value you need to insert. That will increase your key length by 1 bit each time. What I recommend is renormalizing the keys periodically. Every x inserts, or whenever you detect keys being generated too close together, renumber everything from 1 to n.
Edit:
You don't need to compare keys if you're inserting by position instead of key. The compare function for the red-black tree would just use the order in the conceptual list, which lines up with in-order in the tree. If you're inserting in position 4 in the list, insert a node at position 4 in the tree (using in-ordering). If you're inserting after a certain node (such as "a"), it's the same. You might have to use your own implementation if whatever language/library you're using requires a key.
I don't think you can avoid getting size O(n) keys without reassigning the key during operation.
As a practical solution I would build an inverted search tree, with pointers from the children to the parents, where each pointer is marked whether it is coming from a left or right child. To compare two elements you need to find the closest common ancestor, where the path to the elements diverges.
Reassigning keys is then rebalancing of the tree, you can do that by some rotation that doesn't change the order.