I would like to quickly retrieve the median value from a boost multi_index container with an ordered_unique index, however the index iterators aren't random access (I don't understand why they can't be, though this is consistent with std::set...).
Is there a faster/neater way to do this other than incrementing an iterator container.size() / 2 times?
Boost.MultiIndex provide random access indexes, but these index don't take care directly of any order. You can however sort these index, using the sort member function, after inserting a new element, so you will be able to get the median efficiently.
It seems you should make a request to Boost.MultiIndex so the insertion can be done using an order directly, as this should be much more efficient.
I ran into the same problem in a different context. It seems that the STL and Boost don't provide an ordered container that has random access to make use of the ordering (e.g. for comparing).
My (not so pretty) solution was to use a Class that performed the input and "filtered" it in a set. After the input operation was finished it just copied all iterators of the set to a vector and used this for random access.
This solution only works in a very limited context: You perform input on the container once. If you change add to the container again all iterators would have to be copied again. It really was very clumsy to use but worked.
Related
So, I'm working in an environment where pointers are non-existent (or at least, inaccessible), and I'm trying to efficiently implement a stack. I have a stack implementation working, but it's O(n), which of course isn't as efficient as the usual O(1) you get with pointer-based stacks. I just can't figure out a better way to implement this.
Some important background of the limitations of this environment: there's a global array of instances of a class called Entity; variables can only store signed integers; and there's no method of using pointers or even creating new arrays. Super limited.
Entities have members for (x,y,z) coordinates, a map of strings to integers for arbitrary data storage (of integers, at least), and a list of strings for arbitrary string storage. The environment provides no way of comparing two strings, except by comparing them to hard-coded values, and it provides no native way of comparing two integers, unless one is hard-coded; so to compare two variable integers, you have to subtract them and compare to 0 (very Assembly-like in that regard).
The implementation I have now adds a new Entity instance to the list for each entry in the stack, storing its value and index in its map with the keys Value and Index (I know, original). Whenever a value is pushed onto the stack, I iterate through the list and increment the Index of each existing Entity, then create a new Entity with an Index of 0. When it's popped, I iterate through the list, find the one with Index=0, and copy that value; I decrement the Index of every non-zero Entity I find on that list.
It works perfectly, but of course that's O(n) for both pushing and popping. Even if I were to track the head Index somewhere, the only way to find the entry with the matching Index would be to subtract the head Index from all the entries first, which is still O(n).
Is there any way to do this more efficiently than O(n) without access to pointers or even additional arrays? Or is this the best that can be done with these restrictions?
I have algorithms that works with dynamically growing lists (contiguous memory like a C++ vector, Java ArrayList or C# List). Until recently, these algorithms would insert new values into the middle of the lists. Of course, this was usually a very slow operation. Every time an item was added, all the items after it needed to be shifted to a higher index. Do this a few times for each algorithm and things get really slow.
My realization was that I could add the new items to the end of the list and then rotate them into position later. That's one option!
Another option, when I know how many items I'm adding ahead of time, is to add that many items to the back, shift the existing items and then perform the algorithm in-place in the hole I've made for myself. The negative is that I have to add some default value to the end of the list and then just overwrite them.
I did a quick analysis of these options and concluded that the second option is more efficient. My reasoning was that the rotation with the first option would result in in-place swaps (requiring a temporary). My only concern with the second option is that I am creating a bunch of default values that just get thrown away. Most of the time, these default values will be null or a mem-filled value type.
However, I'd like someone else familiar with algorithms to tell me which approach would be faster. Or, perhaps there's an even more efficient solution I haven't considered.
Arrays aren't efficient for lots of insertions or deletions into anywhere other than the end of the array. Consider whether using a different data structure (such as one suggested in one of the other answers) may be more efficient. Without knowing the problem you're trying to solve, it's near-impossible to suggest a data structure (there's no one solution for all problems). That being said...
The second option is definitely the better option of the two. A somewhat better option (avoiding the default-value issue): simply copy 789 to the end and overwrite the middle 789 with 456. So the only intermediate step would be 0123789789.
Your default-value concern is, however, (generally) not a big issue:
In Java, for one, you cannot (to my knowledge) even assign memory for an array that's not 0- or null-filled. C++ STL containers also enforce this I believe (but not C++ itself).
The size of a pointer compared to any moderate-sized class is minimal (thus assigning it to a default value also takes minimal time) (in Java and C# everything is pointers, in C++ you can use pointers (something like boost::shared_ptr or a pointer-vector is preferred above straight pointers) (N/A to primitives, which are small to start, so generally not really a big issue either).
I'd also suggest forcing a reallocation to a specified size before you start inserting to the end of the array (Java's ArrayList::ensureCapacity or C++'s vector::reserve). In case you didn't know - varying-length-array implementations tend to have an internal array that's bigger than what size() returns or what's accessible (in order to prevent constant reallocation of memory as you insert or delete values).
Also note that there are more efficient methods to copy parts of an array than doing it manually with for loops (e.g. Java's System.arraycopy).
You might want to consider changing your representation of the list from using a dynamic array to using some other structure. Here are two options that allow you to implement these operations efficiently:
An order statistic tree is a modified type of binary tree that supports insertions and selections anywhere in O(log n) time, as well as lookups in O(log n) time. This will increase your memory usage quite a bit because of the overhead for the pointers and extra bookkeeping, but should dramatically speed up insertions. However, it will slow down lookups a bit.
If you always know the insertion point in advance, you could consider switching to a linked list instead of an array, and just keep a pointer to the linked list cell where insertions will occur. However, this slows down random access to O(n), which could possibly be an issue in your setup.
Alternatively, if you always know where insertions will happen, you could consider representing your array as two stacks - one stack holding the contents of the array to the left of the insert point and one holding the (reverse) of the elements to the right of the insertion point. This makes insertions fast, and if you have the right type of stack implementation could keep random access fast.
Hope this helps!
HashMaps and Linked Lists were designed for the problem you are having. Given a indexed data structure with numbered items, the difficulty of inserting items in the middle requires a renumbering of every item in the list.
You need a data structure which is optimized to make inserts a constant O(1) complexity. HashMaps were designed to make insert and delete operations lightning quick regardless of dataset size.
I can't pretend to do the HashMap subject justice by describing it. Here is a good intro: http://en.wikipedia.org/wiki/Hash_table
I have a mapping of String id -> Object. Apart from merely having to insert and delete into this map, I would also need to find the id with the lowest x-value (x-value is a member in the Class from which the Object is instantiated).
Initially I thought I could just create another mapping x-value -> String id for this. But that does not help this much, because in case of Remove operation, I have to now anyway search this second map for a particular id (so we are back to the main problem itself now).
Any suggestions to do this efficiently? (time wise - memory is not a big constraint)
EDIT: I think I could just get the x-value from the id (for removal function) and remove from second map using the x-value. Another thing here - the x-value is a float. Good idea to use float as a key in a map ?? Maybe using fabs and a precision value could do the trick here for floating point comparisons ?
EDIT #2: Unfortunately I remembered why the above method might not work (I was busy with other stuff and forgot about this project for a while). The x-value for different map entries NEED NOT BE UNIQUE. String ID is the primary key. So I need to use a multimap and use equal_range.
Your solution of using an auxiliary map isn't as bad as your post suggests.
It is true that a removal operation would require a lookup in the second map. However, this lookup can be done in O(log n) time. This is unlikely to be a deal breaker. If it is, please post more details.
How often do you remove objects? Usually in cases like that you have to think about the frequency of operations too. If the Removing is done infrequently than your solution with the second map could be quite good.
If you use tree map for the second mapping, you will immediatelly have minimum element and it will take O(log n) to remove element from it.
One other alternative is to use priority queue backed by double linked list to find minimal element and in first map remember direct reference to the node of the element. This node can be used for removal.
I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?
You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5
There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.
If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.
Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.
What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.
I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic