Using Linked List to represent a Matrix Class - matrix

I'm having trouble initializing the linked list for the matrix based on the parameters I input. So if I input the parameters (3,3) it should actually make make 4x4 so I can use the first column and first row for indexing. and the left top corner node as an entry point.
def __init__(self, m, n, default=0):
self._head = MatrixNode(None)
for node in range(m - 1):
node = MatrixNode(0)
node._right = node
for node in range(n - 1):
node = MatrixNode(0)
node._down = node
this is what I have so far but I'm sure its horrible.

At first, it may be useful to know, what a MatrixNode is. I guess you just want to store a value there?
Then i see two linear loops, while a matrix is a n*m data structure. Are you sure, your loops do not need to be nested to initialize your structure correctly?
For linked lists i would expect something like row.next = nextrow and row.startnode.next = nextnode, i do not see anything like this here.
Having this said, i want to ask you, if you really want to implement a matrix yourself, and in such an object oriented (inefficient!) way.
You can use two-dimensional arrays (a=[[1,2], [3,4]];a[0][0]==1) or a good implementation from a numerics library like numpy/scipy.
There you have numpy.array for storing n-dimensional data (with nice addressing like matrix[1,2] and similiar syntax to matlab) or numpy.matrix which is like an array with some methods overloaded for matrix operations (i.e. matrix-matrix multiplication of arrays is pointwise and for matrices it's the usual matrix multiplication).

You are right, its horrible :)
First things first, a linked list is a very bad way of representing a matrix.
If you want to represent a matrix, start with a list of lists, and work from there if that's not enough (see other answer mentioning numpy, for example)
If you want to learn to use linked lists, choose a better example.
Then: you are re-using the variable name "node" for different things:
Your loop index. The code for node in range(...) will assign an integer from the range to node in every iteration.
Then you assign a new MatrixNode to node, and then you set the node's neighbor (_right or _down) to be not the actual neighbor, but itself (node._right = node).
You also never save your nodes that you create inside the loops anywhere, so they will be garbage-collected.
And you never use the optional argument default.

Related

Algorithm for selection the most frequent object during factorization

I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.

Adjacency Set Representaion in Python

So I got this really cool book from university library today, Python Algorithms by Magnus Lie Hetland and in the Chapter second of the book he creates the adjacency list as follows, which was kind of cool:
a,b,c,d,e,f,g,h = range(8)
N = [{b,c,d,e,f},{c,e},{d},{e},{f},{c,g,h},{f,h},{f,g}]
And when I do:
N[a] I get the first element of N, and it's kind of surprising me how did it got mapped in such a manner?
I found this question but it's different than what I am asking still let me know if it's a duplicate.
Adjacency List and Adjacency Matrix in Python
Thanks,
Prerit
It's just Python.
a,b,c,d,e,f,g,h = range(8)
is tuple assignment. It assigns 0 to a, 1 to b, etc.
N = [{b,c,d,e,f},{c,e},{d},{e},{f},{c,g,h},{f,h},{f,g}]
creates an array named N where the 0'th element is the set {b,c,d,e,f}, etc.
So when you say N[a], you're also saying N[0], and that's the set you're seeing.
It's a cool trick for building a constant graph by hard-coding in Python, but if you need to build the graph dynamically based on input or output from another algorithm, then you'll want a different representation.

Finding the width of a directed acyclic graph... with only the ability to find parents

I'm trying to find the width of a directed acyclic graph... as represented by an arbitrarily ordered list of nodes, without even an adjacency list.
The graph/list is for a parallel GNU Make-like workflow manager that uses files as its criteria for execution order. Each node has a list of source files and target files. We have a hash table in place so that, given a file name, the node which produces it can be determined. In this way, we can figure out a node's parents by examining the nodes which generate each of its source files using this table.
That is the ONLY ability I have at this point, without changing the code severely. The code has been in public use for a while, and the last thing we want to do is to change the structure significantly and have a bad release. And no, we don't have time to test rigorously (I am in an academic environment). Ideally we're hoping we can do this without doing anything more dangerous than adding fields to the node.
I'll be posting a community-wiki answer outlining my current approach and its flaws. If anyone wants to edit that, or use it as a starting point, feel free. If there's anything I can do to clarify things, I can answer questions or post code if needed.
Thanks!
EDIT: For anyone who cares, this will be in C. Yes, I know my pseudocode is in some horribly botched Python look-alike. I'm sort of hoping the language doesn't really matter.
I think the "width" you're considering here isn't really what you want - the width depends on how you assign levels to each node where you have some choice. You noticed this when you were deciding whether to assign all sources to level 0 or all sinks to the max level.
Instead, you just want to count the number of nodes and divide by the "critical path length", which is the longest path in the dag. This gives the average parallelism for the graph. It depends only on the graph itself, and it still gives you an indication of how wide the graph is.
To compute the critical path length, just do what you're doing - the critical path length is the maximum level you end up assigning.
In my opinion when you're doing this type of last minute development, its best to keep the new structures separate from the ones you are already using. At this point, if I were pressed by time I would go for a simpler solution.
Create an adjacency matrix for the graph using the parent data (should be easy)
Perform a topological sort using this matrix. (or even use tsort if pressed for time)
Now that you have a topological sort, create an array level, one element for each node.
For each node:
If the node has no parents set its level to 0
Otherwise set it to the minimum of level its parents + 1.
Find the maximum level width.
The question is as Keith Randall asked, is this the right measurement you need?
Here's what I (Platinum Azure, the original author) have so far.
Preparations/augmentations:
Add "children" field to linked list ("DAG") node
Add "level" field to "DAG" node
Add "children_left" field to "DAG" node. This is used to make sure that all children are examined before a parent is examined (in a later stage of the algorithm).
Algorithm:
Find the number of immediate children for all nodes; also, determine leaves by adding nodes with children==0 to list.
for l in L:
l.children = 0
for l in L:
l.level = 0
for p in l.parents:
++p.children
Leaves = []
for l in L:
l.children_left = l.children
if l.children == 0:
Leaves.append(l)
Assign every node a "reverse depth" level. Normally by depth, I mean topologically sort and assign depth=0 to nodes with no parents. However, I'm thinking I need to reverse this, with depth=0 corresponding to leaves. Also, we want to make sure that no node is added to the queue without all its children "looking at it" first (to determine its proper "depth level").
max_level = 0
while !Leaves.empty():
l = Leaves.pop()
for p in l.parents:
--p.children_left
if p.children_left == 0:
/* we only want to append parents with for sure correct level */
Leaves.append(p)
p.level = Max(p.level, l.level + 1)
if p.level > max_level:
max_level = p.level
Now that every node has a level, simply create an array and then go through the list once more to count the number of nodes in each level.
level_count = new int[max_level+1]
for l in L:
++level_count[l.level]
width = Max(level_count)
So that's what I'm thinking so far. Is there a way to improve on it? It's linear time all the way, but it's got like five or six linear scans and there will probably be a lot of cache misses and the like. I have to wonder if there isn't a way to exploit some locality with a better data structure-- without actually changing the underlying code beyond node augmentation.
Any thoughts?

A data structure based on the R-Tree: creating new child nodes when a node is full, but what if I have a lot of objects at the exact same position?

I realize my title is not very clear, but I am having trouble thinking of a better one. If anyone wants to correct it, please do.
I'm developing a data structure for my 2 dimensional game with an infinite universe. The data structure is based on a simple (!) node/leaf system, like the R-Tree.
This is the basic concept: you set howmany childs you want a node (a container) to have maximum. If you want to add a leaf, but the node the leaf should be in is full, then it will create a new set of nodes within this node and move all current leafs to their new (more exact) node. This way, very populated areas will have a lot more subdivisions than a very big but rarely visited area.
This works for normal objects. The only problem arises when I have more than maxChildsPerNode objects with the exact same X,Y location: because the node is full, it will create more exact subnodes, but the old leafs will all be put in the exact same node again because they have the exact same position -- resulting in an infinite loop of creating more nodes and more nodes.
So, what should I do when I want to add more leafs than maxChildsPerNode with the exact same position to my tree?
PS. if I failed to explain my problem, please tell me, so I can try to improve the explanation.
Update: this is how I check if all leafs in a full node have identical positions:
//returns whether all leafs in the given leaf list are identical
private function allLeafsHaveSamePos(leafArr:Array<RTLeaf<T>>):Bool {
if (leafArr.length > 1) {
var lastLeafTopX:Float = leafArr[0].topX;
var lastLeafTopY:Float = leafArr[0].topY;
for (i in 1...leafArr.length) {
if ((leafArr[i].topX != lastLeafTopX) || (leafArr[i].topY != lastLeafTopY)) return false;
}
}
return true;
}
I would like to ask a question...
is it that important than the maxChildsPerNode constraint be respected ?
I would rather think of this maximum as a hint to the structure for when to split, and simply ignore it when there is no meaningful way to perform the split.
Of course you'd better rethink the name then, otherwise it'd be odd for the next maintainer.
In pseudo code I would use something like this:
def AddToNode(node, item):
node.items.append(item)
if len(node.items) > node.splitHint:
leftNode = Node(node.splitHint)
rightNode = Node(node.splitHint)
node.split(leftNode, rightNode)
if len(leftNode.items) == 0 or len(rightNode.items) == 0:
node.splitHint *= 1.5 # famous golden ratio ;)
else:
node.items = [leftNode, rightNode]
The important step is to modify the hint when it's detected than we can't abide by it in order not to perform this check at each insertion (this way we obtain a constant amortized cost).
It looks like a bit of a mismatch between your data and your structure, since you have a structure that assumes N objects within an arbitrarily large area when you're supplying it with >N objects on an infinitely small point. It might be worth using a different structure for this data?
Hack fix: apply a tiny random displacement to your newly created objects. This should allow the space to be subdivided by the existing algorithm.
Better fix: ensure that your algorithm for splitting a leaf node always generates at least 2 new leaf nodes to begin with. When reassigning objects to the new leaf nodes, or when performing normal insertions, iterate over all the candidates, and if more than one is equally suitable then you can tie-break based on how full they are. This should result in your co-located players ending up split evenly across the otherwise identical nodes.
From common sense I'd not assume having two objects in the same position ever, but if this is a part of the idea, then I would introduce one more axis, say 'spin', an integer number, and impose a restriction that all your objects are fermions (cannot have the same location and spin at the same time).
If you have a set of objects on the exact same spot, any query for a region that contains one should return all - so there's no reason to split them, as it doesn't gain you anything. Simply either count the number of distinct locations when deciding to split, or have each element on the leaf node be an object that encapsulates (coordinates, [list of objects at those coordinates]).

Remembering the "original" index of elements after sorting

Say, I employ merge sort to sort an array of Integers. Now I need to also remember the positions that elements had in the unsorted array, initially. What would be the best way to do this?
A very very naive and space consuming way to do would be to (in C), to maintain each number as a "structure" with another number storing its index:
struct integer {
int value;
int orig_pos;
};
But, obviously there are better ways. Please share your thoughts and solution if you have already tackled such problems. Let me know if you would need more context. Thank you.
Clearly for an N-long array you do need to store SOMEwhere N integers -- the original position of each item, for example; any other way to encode "1 out of N!" possibilities (i.e., what permutation has in fact occurred) will also take at least O(N) space (since, by Stirling's approximation, log(N!) is about N log(N)...).
So, I don't see why you consider it "space consuming" to store those indices most simply and directly. Of course there are other possibilities (taking similar space): for example, you might make a separate auxiliary array of the N indices and sort THAT auxiliary array (based on the value at that index) leaving the original one alone. This means an extra level of indirectness for accessing the data in sorted order, but can save you a lot of data movement if you're sorting an array of large structures, so there's a performance tradeoff... but the space consumption is basically the same!-)
Is the struct such a bad idea? The alternative, to me, would be an array of pointers.
It feels to me that in this question you have to consider the age old question: speed vs size. In either case, you are keeping both a new representation of your data (the sorted array) and an old representation of your data (the way the array use to look), so inherently your solution will have some data replication. If you are sorting n numbers, and you need to remember after they were sorted where those n numbers were, you will have to store n amount of information somewhere, there is no getting around that.
As long as you accept that you are doubling the amount of space you need to be able to keep this old data, then you should consider the specific application and decide what will be faster. One option is to just make a copy of the array before you sort it, however resolving which was where later might turn into a O(N) problem. From that point of view your suggestion of adding another int to your struct doesn't seem like such a bad idea, if it fits with the way you will be using the data later.
This looks like the case where I use an index sort. The following C# example shows how to do it with a lambda expression. I am new at using lambdas, but they can do some complex tasks very easily.
// first, some data to work with
List<double> anylist = new List<double>;
anylist.Add(2.18); // add a value
... // add many more values
// index sort
IEnumerable<int> serial = Enumerable.Range(0, anylist.Count);
int[] index = serial.OrderBy(item => (anylist[item])).ToArray();
// how to use
double FirstValue = anylist[index[0]];
double SecondValue = anylist[index[1]];
And, of course, anylist is still in the origial order.
you can do it the way you proposed
you can also remain a copy of the original unsorted array (means you may use a not inplace sorting algorithm)
you can create an additional array containing only the original indices
All three ways are equally space consuming, there is no "better" way. you may use short instead of int to safe space if you array wont get >65k elements (but be aware of structure padding with your suggestion).

Resources