I've recently implemented a piece of work that relied on a Map of Map structure, simply because I needed two keys to pointed to a single value. To my understanding, dictionaries conceptually should use a single key per value to map the entire thing.
What would a proper data structure be for a pair of keys pointing to a single value? Which kind of collections would you use in Java or C++ to achieve this, for example?
Let us take a practical example: values of a function in certain integer points on a two-dimensional plane. So the function is f : (int x, int y) -> double value.
The straighforward way is to pack all arguments in a struct.
In this example, it means the function maps points to values, and a point is a pair of two ints, or a custom struct Point with two int fields or getters if needed.
So the data structure is map <Point, double> f, where Point can be pair <int, int> in the simple case.
We can also say that the function is a mapping from int x to a family of one-argument functions, which map int y to a double value.
The data structure will then look as follows: map <int, map <int, double> > f.
The choice depends on the function you are modeling, and maybe on performance considerations.
Related
I'm filling a stack/vector (a dynamically sized container with fast random access by index with insertion only at the end) with composite data (a struct, class, tuple…). For a specific attribute with a small set of possible values, I will want to access the nth of all elements in the stack where this attribute satisfies a condition. To achieve this, additional information can be stored along each composite or in a separate data structure.
Note that the vector is large and that the compared attribute has a small value range but is compared to a set of allowed values. Also the attributes aren't distributed evenly throughout composites in the vector.
Pseudocode of a O(n) naïve approach. How can I improve this:
enum Fruit { apple, orange, banana, potato };
struct c {
Fruit a;
Data d;
}
// Let's assume v has a length of many thousand and that the distribution of fruits is *not* completely random e.g. that maybe potato only rarely occurs or that bananas tend to come in packs
c getFruit(vector<c> v, set<Fruit> s, int n) {
int counter=0;
// iterate over all of v's indices
for(int i=0 ; i<v.length; i+=1) {
if(v[i].a in s) {
if(n==counter) {
return v[i];
}
counter+=1;
}
}
}
// note: The attribute is compared to a set (arbitrary combination of fruits)!
getFruit(largeVector, set{apple, orange, potato}, 15234)
Another approach would be to create a vector for each possible set of fruits which would be super fast O(1) but not so memory efficient.
(Although I do have to implement this now, I'm really just asking out of curiousity because my data is small enough to just go with the naïve approach.)
Any argument why there doesn't seem to a more efficient way is very much approved as well.
Edit: It should be noted that new elements may be appended between queries for indices using the algorithm in question so any caches have to grow with the vector and both growing the vector and this filtered access should be fast.
For each index of the vector, store the preceding number of each fruit.
Then you can do a binary search to find the first index where the sum of the desired fruit counts is sufficient.
If you don't want to use that much memory, then store the counts in a separate arrays, and only store them for every 16th index or so in the main array. Your binary search will then get you an index within 16 positions of the desired answer, and you can do a linear scan from there.
I'm trying to make an application that is fed in an algebra equation and solves for a given variable of the users choosing.
Pseudocode below
enum Variable
x, pi, y, z; //.. etc
class Value
double constant;
Variable var;
class Term
Value val; // Might be a variable or a constant
Expression exponent; // The exponent of this term
boolean sign; // Negative flag
class Expression
LinkedList<Term>; // All the terms in this expression
^ This is what I need help on.
Expression exponent; // The exponent of this term
For example the average equation might be:
y = x + (x - 5)^z
^term ^term ^ operator ^ expression^term
I need to store this information in some sort of data structure however in order to parse through it. As you see above when I wrote LinkedList<Term>, it works but there's no way for me to implement operators.
Using the above example, this is how I want my data structure to look like:
// Left side of the equals sign
{ NULL <-> y <-> NULL }
// Right side of the equals sign
{ NULL <-> x <-> Operator.ADD <-> Expression: (x - 5) <-> NULL }
I can't do this though, because LinkedList needs to be of one data type, which needs to be expression. How should i represent operators?
It is significantly easier to work with expressions when you have them represented as abstract syntax trees, tree structures that show the underlying structures of formulas. I would strongly recommend investigating how to use ASTs here; you typically build them with a parsing algorithm (Dijkstra's shunting-yard algorithm might work really well for you based on your setup) and then use either abstract methods or the visitor pattern to traverse the ASTs to perform the computations you need.
ASTs are often represented by having either an interface or an abstract class representing a node in the tree, then having subclasses for each operator you'd encounter (they represent internal nodes) and subclasses for concepts like "number" or "variable" (typically, they're leaves).
If you'd like to get a sense of what this might look like, I implemented a tool to generate truth tables for propositional logic formulas using these techniques. The JavaScript source shows off how to use ASTs and the shunting-yard algorithm.
What is the difference between using an array to store in x, y, and z versus using an object (struct) that has x, y, and z coordinates as variables when it comes to readability, speed, memory, and so on.
Any information is much appreciated!
Thanks,
Al
If you mean a struct like in C, it's stored the same way as an array in memory. In fact, if your struct had only int fields for instance, you could cast a pointer to that struct to an int pointer and see it behaves the same as an int array. I would not recommend that though, just an observation.
I don't see a benefit of using one over the other, just do what's easiest for you. I would prefer the struct however since the field names are more descriptive than array indices.
I have a large genetic dataset (X, Y coordinates), of which I can easily know one dimension (X) during runtime.
I drafted the following code for a matrix class which allows to specify the size of one dimension, but leaves the other one dynamic by implementing std::vector. Each vector is new'd using unique_ptr, which is embedded in a C-style array, also with new and unique_ptr.
class Matrix
{
private:
typedef std::vector<Genotype> GenVec;
typedef std::unique_ptr<GenVec> upGenVec;
std::unique_ptr<upGenVec[]> m;
unsigned long size_;
public:
// ...
// construct
Matrix(unsigned long _size): m(new upGenVec[_size]), size_(_size)
{
for (unsigned long i = 0; i < this->size_; ++i)
this->m[i] = upGenVec(new GenVec);
}
};
My question:
Does it make sense to use this instead of std::vector< std::vector<Genotype> > ?
My reasoning behind this implementation is that I only require one dimension to be dynamic, while the other should be fixed. Using std::vectors could imply more memory allocation than needed. As I am working with data that would fill up estimated ~50GB of RAM, I would like to control memory allocation as much as I can.
Or, are there better solutions?
I won't cite any paragraphs from specification, but I'm pretty sure that std::vector memory overhead is fixed, i.e. it doesn't depend on number of elements it contains. So I'd say your solution with C-style array is actually worse memory-wise, because what you allocate, excluding actual data, is:
N * pointer_size (first dimension array)
N * vector_fixed_size (second dimension vectors)
In vector<vector<...>> solution what you allocate is:
1 * vector_fixed_size (first dimension vector)
N * vector_fixed_size (second dimension vectors)
Given a number of lists of items, find the lists with matching items.
The brute force pseudo-code for this problem looks like:
foreach list L
foreach item I in list L
foreach list L2 such that L2 != L
for each item I2 in L2
if I == I2
return new 3-tuple(L, L2, I) //not important for the algorithm
I can think of a number of different ways of going about this - creating a list of lists and removing each candidate list after searching the others for example - but I'm wondering if there is a better algorithm for this?
I'm using Java, if that makes a difference to your implementation.
Thanks
Create a Map<Item,List<List>>.
Iterate through every item in every list.
each time you touch an item, add the current list to that item's entry in the Map.
You now have a Map entry for each item that tells you what lists that item appears in.
This algorithm is about O(N) where N is the number of lists (the exact complexity will be affected by how good your Map implementation is). I believe your algorithm was at least O(N^2).
Caveat: I am comparing number of comparisons, not memory use. If your lists are super huge and full of mostly non duplicated items, the map that my method creates might become too big.
As per your comment you want a MultiMap implementation. A multimap is like a Map but it can map each key to multiple values. Store the value and a reference to all the maps that contain that value.
Map<Object, List>
of course you should use a type safe instead of Object and a type safe List as the value. What you are trying to do is called an Inverted Index.
I'll start with the assumption that the datasets can fit in memory. If not, then you will need something fancier.
I refer below to a "set", where I am thinking of something like a C++ std::set. I don't know the Java equivalent, but any storage scheme that permits rapid lookup (tree, hash table, whatever).
Comparing three lists: L0, L1 and L2.
Read L0, placing each element in a set: S0.
Read L1, placing items that match an element of S0 into a new set: S1, and discarding others.
Discard S0.
Read L2, keeping items that match an element of S1 and discarding others.
Update
Just realised that the question was for "n" lists, not three. However the extension should be obvious. (I hope)
Update 2
Some untested C++ code to illustrate the algorithm
#include <string>
#include <vector>
#include <set>
#include <cassert>
typedef std::vector<std::string> strlist_t;
strlist_t GetMatches(std::vector<strlist_t> vLists)
{
assert(vLists.size() > 1);
std::set<std::string> s0, s1;
std::set<std::string> *pOld = &s1;
std::set<std::string> *pNew = &s0;
// unconditionally load first list as "new"
s0.insert(vLists[0].begin(), vLists[0].end());
for (size_t i=1; i<vLists.size(); ++i)
{
//swap recently read "new" to "old" now for comparison with new list
std::swap(pOld, pNew);
pNew->clear();
// only keep new elements if they are matched in old list
for (size_t j=0; j<vLists[i].size(); ++j)
{
if (pOld->end() != pOld->find(vLists[i][j]))
{
// found match
pNew->insert(vLists[i][j]);
}
}
}
return strlist_t(pNew->begin(), pNew->end());
}
You can use a trie, modified to record what lists each node belongs to.