If I have a large collection (40m+ objects) and I'm looking to find:
(a : 1 OR b : 2) && c : 3
sorted by d : 1
I understand that I would create an index:
{ "a":1, "b":1, "c":1, "d":1 }
But if I wanted to allow the sort order to be reversed, would I need an additional index:
{ "a":1, "b":1, "c":1, "d":-1 }
Many thanks.
If you want the result sorted in increasing or decreasing order, compound indexes may fail you. Remember that compound indexes sort the tuples in lexicographical order. There is no reason to believe that the tuples are always in increasing order on the last key (d).
You could simply do the following:
db.foo.find(...).sort({d:1});
db.foo.find(...).sort({d:-1});
Sorting, even on large sets of tuples is fairly fast.
Related
I have a large list of some elements sorted by their probabilities:
data class Element(val value: String, val probability: Float)
val sortedElements = listOf(
Element("dddcccdd", 0.7f),
Element("aaaabb", 0.2f),
Element("bbddee", 0.1f)
)
Now I need to perform a prefix searches on this list to find items that start with one prefix and then with the next prefix and so on (elements still need to be sorted by probabilities)
val filteredElements1 = sortedElements
.filter { it.value.startsWith("aa") }
val filteredElements2 = sortedElements
.filter { it.value.startsWith("bb") }
Each "request" of elements filtered by some prefix takes O(n) time, which is too slow in case of a large list.
If I didn't care about the order of the elements (their probabilities), I could sort the elements lexicographically and perform a binary search: sorting takes O(n*log n) time and each request -- O(log n) time.
Is there any way to speed up the execution of these operations without losing the sorting (probability) of elements at the same time? Maybe there is some kind of special data structure that is suitable for this task?
You can read more about Trie data structure https://en.wikipedia.org/wiki/Trie
This could be really useful for your usecase.
Leetcode has another very detailed explanation on it, which you can find here https://leetcode.com/articles/implement-trie-prefix-tree/
Hope this helps
If your List does not change often, you could create a HashMap where each existing Prefix is a key referring to a collection (sorted by probability) of all entries it is a prefix of.
getting all entries for a given prefix needs ~O(1) then.
Be careful the Map get really big. And creation of the map takes quite some time.
We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java
I asked a question that was basically a knapsack problem - I needed to find the combination of several different array of objects that gave the optimal output. So for example, the highest sum "value" from the objects with respect to a limit on the "cost" of each object. The answer I received here was the following-
a.product(b,c)
.select{ |arr| arr.reduce(0) { |sum,h| sum + h[:cost] } < 30 }
.max_by{ |arr| arr.reduce(0) { |sum,h| sum + h[:value] } }
Which works great, but as I get into 6 arrays with ~40 choices each, the possible combinations get upwards of 4 million and take too long to process. I made some changes to the code that made processing faster -
#creating the array doesn't take too long
combinations = a.product(b,c,d,e)
possibles = []
combinations.each do |array_of_objects|
#max_cost is a numeric parameter, and I can't have the same exact object used twice
if !(array_of_objects.sum(&:salary) > max_cost) or !(array_of_objects.uniq.count < array_of_objects.count)
possibles << array_of_objects
end
end
possibles.max_by{ |ar| ar.sum(&:std_proj) }
Breaking it into two separate arrays helped the performance a lot as I only had to check the max_by for many less possible combinations that fit the criteria.
Does anyone see a way to optimize this code? Since I'm typically dealing with tens of thousands or millions of combinations, any little bit could greatly help. Thanks.
If we are talking about millions of rows, and the operations are like unique and max.
I suggest you to solve it by using DISINCT and MAX() in your query and You can even use WHERE filtering by cost.
Looping over the objects in Ruby, is clearly more expensive.
Why I can't use table.sort to sort tables with associative indexes?
In general, Lua tables are pure associative arrays. There is no "natural" order other than the as a side effect of the particular hash table implementation used in the Lua core. This makes sense because values of any Lua data type (other than nil) can be used as both keys and values; but only strings and numbers have any kind of sensible ordering, and then only between values of like type.
For example, what should the sorted order of this table be:
unsortable = {
answer=42,
true="Beauty",
[function() return 17 end] = function() return 42 end,
[math.pi] = "pi",
[ {} ] = {},
12, 11, 10, 9, 8
}
It has one string key, one boolean key, one function key, one non-integral key, one table key, and five integer keys. Should the function sort ahead of the string? How do you compare the string to a number? Where should the table sort? And what about userdata and thread values which don't happen to appear in this table?
By convention, values indexed by sequential integers beginning with 1 are commonly used as lists. Several functions and common idioms follow this convention, and table.sort is one example. Functions that operate over lists usually ignore any values stored at keys that are not part of the list. Again, table.sort is an example: it sorts only those elements that are stored at keys that are part of the list.
Another example is the # operator. For the above table, #unsortable is 5 because unsortable[5] ~= nil and unsortable[6] == nil. Notice that the value stored at the numeric index math.pi is not counted even though pi is between 3 and 4 because it is not an integer. Furthermore, none of the other non-integer keys are counted either. This means that a simple for loop can iterate over the entire list:
for i in 1,#unsortable do
print(i,unsortable[i])
end
Although that is often written as
for i,v in ipairs(unsortable) do
print(i,v)
end
In short, Lua tables are unordered collections of values, each indexed by a key; but there is a special convention for sequential integer keys beginning at 1.
Edit: For the special case of non-integral keys with a suitable partial ordering, there is a work-around involving a separate index table. The described content of tables keyed by string values is a suitable example for this trick.
First, collect the keys in a new table, in the form of a list. That is, make a table indexed by consecutive integers beginning at 1 with keys as values and sort that. Then, use that index to iterate over the original table in the desired order.
For example, here is foreachinorder(), which uses this technique to iterate over all values of a table, calling a function for each key/value pair, in an order determined by a comparison function.
function foreachinorder(t, f, cmp)
-- first extract a list of the keys from t
local keys = {}
for k,_ in pairs(t) do
keys[#keys+1] = k
end
-- sort the keys according to the function cmp. If cmp
-- is omitted, table.sort() defaults to the < operator
table.sort(keys,cmp)
-- finally, loop over the keys in sorted order, and operate
-- on elements of t
for _,k in ipairs(keys) do
f(k,t[k])
end
end
It constructs an index, sorts it with table.sort(), then loops over each element in the sorted index and calls the function f for each one. The function f is passed the key and value. The sort order is determined by an optional comparison function which is passed to table.sort. It is called with two elements to compare (the keys to the table t in this case) and must return true if the first is less than the second. If omitted, table.sort uses the built-in < operator.
For example, given the following table:
t1 = {
a = 1,
b = 2,
c = 3,
}
then foreachinorder(t1,print) prints:
a 1
b 2
c 3
and foreachinorder(t1,print,function(a,b) return a>b end) prints:
c 3
b 2
a 1
You can only sort tables with consecutive integer keys starting at 1, i.e., lists. If you have another table of key-value pairs, you can make a list of pairs and sort that:
function sortpairs(t, lt)
local u = { }
for k, v in pairs(t) do table.insert(u, { key = k, value = v }) end
table.sort(u, lt)
return u
end
Of course this is useful only if you provide a custom ordering (lt) which expects as arguments key/value pairs.
This issue is discussed at greater length in a related question about sorting Lua tables.
Because they don't have any order in the first place. It's like trying to sort a garbage bag full of bananas.
I have quite a big amount of fixed size records. Each record has lots of fields, ID and Value are among them. I am wondering what kind of data structure would be best so that I can
locate a record by ID(unique) very fast,
list the 100 records with the biggest values.
Max-heap seems work, but far from perfect; do you have a smarter solution?
Thank you.
A hybrid data structure will most likely be best. For efficient lookup by ID a good structure is obviously a hash-table. To support top-100 iteration a max-heap or a binary tree is a good fit. When inserting and deleting you just do the operation on both structures. If the 100 for the iteration case is fixed, iteration happens often and insertions/deletions aren't heavily skewed to the top-100, just keep the top 100 as a sorted array with an overflow to a max-heap. That won't modify the big-O complexity of the structure, but it will give a really good constant factor speed-up for the iteration case.
I know you want pseudo-code algorithm, but in Java for example i would use TreeSet, add all the records by ID,value pairs.
The Tree will add them sorted by value, so querying the first 100 will give you the top 100. Retrieving by ID will be straight-forward.
I think the algorithm is called Binary-Tree or Balanced Tree not sure.
Max heap would match the second requirement, but hash maps or balanced search trees would be better for the first one. Make the choice based on frequency of these operations. How often would you need to locate a single item by ID and how often would you need to retrieve top 100 items?
Pseudo code:
add(Item t)
{
//Add the same object instance to both data structures
heap.add(t);
hash.add(t);
}
remove(int id)
{
heap.removeItemWithId(id);//this is gonna be slow
hash.remove(id);
}
getTopN(int n)
{
return heap.topNitems(n);
}
getItemById(int id)
{
return hash.getItemById(id);
}
updateValue(int id, String value)
{
Item t = hash.getItemById(id);
//now t is the same object referred to by the heap and hash
t.value = value;
//updated both.
}