I want to make an infinite tiled map, from (-max_int,-max_int) until (max_int,max_int), so I'm gonna make a basic structure: chunk, each chunk contain char tiles[w][h] and also it int x, y coordinates, so for example h=w=10 so tile(15,5) is in chunk(1,0) on (5,5) coordinate, and tile(-25,-17) is in chunk(-3,-2)on(5,3) and so on. Now there can be any amount of chunks and I need to store them and easy access them in O(logn) or better ( O(1) if possible.. but it's not.. ). It should be easy to: add, ??remove??(not must) and find. So what data structure should I use?
Read into KD-tree or Quad-tree (the 2d variant of Octree). Both of these might be a big help here.
So all your space is splited into chunks (rectangular clusters). Generally problem is storing data in sparse (since clusters already implemented) matrix. Why not to use two-level dictionary-like containers?.. I.e. rb-tree by row index where value is rb-tree by column index. Or if you are lucky you can use hashes to get your O(1). In both cases if you can't find row you allocate it in container and create new container as value but initially with only single chunk. Of course allocating new chunk on existing row will be a bit faster than on new one and I guess that's the only issue with this approach.
Related
I have to implement a Trie of codes of a given fixed-length. Each code is a sequence of integers and considering that some patterns are usual, I decided to implement a Trie in order to store all the codes.
I also need to iterate throught the codes given they lexicograph order and I'm expecting to work with millions (maybe billions) of codes.
This is why I considered implementing this particular Trie as a dictionary where each key is the index of a given prefix.
Let's say key 0 has a list of his prefix children and for each one i save the corresponding entry on the dictionary...
Example: If my first insertion is the code 231, then the dictionary would look like:
[0]->{(2,1)}
[1]->{(3,2)}
[2]->{(1,3)}
This way, if my second insertion would be 243, the dictionary would be updated this way:
[0]->{(2,1)}
[1]->{(3,2),(4,3)} *Here each list is sorted using a flat_map
[2]->{(1,endMark)}
[3]->{(3,endMark)}
My problem is that I have been using a vector for this purpuse and because having all the dictionary in contiguos memory allows me to have a better performance while iterating over it.
Now, when I need to work with BIG instances of my problem, due to resizing the vector I cannot work with millions of codes (memory consuption could be as much as 200GB).
Now I have tried google's sparse hash insted of the vector and my question is, do you have any suggestion? any other alternative in mind? Is there any other way to work with integers as keys to improve performance?
I know I wont have any collision because each key would be different from the rest.
Best regards,
Quentin
I am trying to implement a hashtable using linear probing.
Before inserting a (key, value) pair into the hashtable, I want to check if it's half full. If it is, I need to double the size of the underlying array.
Obviously, there are two ways to do that:
One is to create another array with the doubled size, rehash all entries in the old one and add them to the new array. Then, rebind the old array to the new one. This way is easy to implement but uses a lot of space.
The other one is to double the array and do the rehashing in-place. It seems that this way may lead to longer running time because rehashing may cause collisions with both newly hashed slots and old slots.
Which way should I use?
Your second solution only saves space during the resize process if there is in fact room to expand the existing hash table in-place - I think the chances of that being the case for a large hash table are quite slim, so I would just go for your first solution.
At some point we need to increase the size of hash, and normally we just rehash, which leads to re-constructure of the whole hash.
Is there any better solution so that when we increase the size, we don't need to re-construct the whole thing?
You could use http://en.wikipedia.org/wiki/Extendible_hashing, although AFAIK it is used mostly for on-disk databases.
There are also general methods for smoothing out some amortised costs. Starting points for this would be http://en.wikipedia.org/wiki/Static_and_dynamic_data_structures and http://en.wikipedia.org/wiki/Dynamization. One application of this to hash tables would be to always keep two tables, one of size N and one of size 2N or so. When the smaller overflows, start creating a table of size 4N, but don't populate it straight away - populate it incrementally while using the table of size 2N. By the time the table of size 2N is full, the table of size 4N should be ready. For the special case of hash tables, extendible hashing should be better.
Any time you re-hash, there's nothing that says you need to actually re-hash. In fact all that you actually need to do is re-mod (i.e. shift everything's position).
If you cache the hash (hehe, sounds like the start of a dr. seuss book) then you only need to compute it once. So store the hash along with the actual data, and that will save you from needing to calculate the hash again in the future. However I'm assuming that you're not already doing this, you didn't exactly explain the current process.
// Store these instead of the data directly. This assumes immutable data.
struct hashable_item
{
data dat;
int32 hash;
}
I am writing some color management code, and I am dealing with LUTs (look up tables).
I can read the color profile LUT and convert my values... but, how can I do the inverse operation? maybe, is there a good algorithm to generate the 'inverse' of a LUT?
If your LUT is a given, the simplest method is to find the closest entry to any given color value. You can accelerate this computation by a variety of methods; for example, you can build a k-d tree out of your LUT entries and use it to eliminate most of the comparisons an exhaustive check would require.
However, this will tend to result in a "posterized" image, since smooth areas in your image will shift abruptly from one entry to the next. You can avoid this by taking your pixels in (quasi-)random order, picking the best fit from your LUT, and pushing the difference between the pixel value and the chosen entry back onto the nearby pixels which haven't already been chosen.
There are a variety of ways to do this last, but they all result in a dithering effect that generally makes better use (for imaging purposes) of the available LUT entries than the simple, per-pixel operation can.
Yes, you can usually invert a lookup table efficiently (linear time), assuming that the function is a bijection. If your lookup table maps two different keys to the same value, then there is no direct way to invert the table because you would end up needing to have a value that maps to two different keys. If you're okay with this that's fine, though it may call into question why you're trying to build the reverse map.
If you know that every value is unique, you can build an inverse lookup table as follows. First, create a data structure to hold the mapping from values to keys - perhaps a hash table, or a balanced binary tree, or a raw array if the values are small integers. Next, iterate over each key/value pair from the lookup table, then insert the mapping value → key into the new lookup table. This can be done in linear time plus the time required to insert the values into the new container.
I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.
If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.
edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.