Is it wise to leave holes in a table? - performance

Caution: May be hard to understand
So let's say I have a table which needs to hold indexes that correspond to a different table holding objects. For the sake of this question, I'm going to refer to these indexes as "idxs". I would like to have it so that it's viable to remove a specific idx quickly, but also to be able to loop over all the idxs quickly as well.
I want this to go as fast as possible, and I have a few options to choose from:
Note: I would say there may be anywhere from 500 - 3000 objects at maximum, which means up to that many idxs to refer to them. However, there probably won't be more than 200 idxs stored in my table at a time.
I can store each idx by pushing them onto the end of the table, which makes it easy to loop over them, but it creates a problem with removing them since I have to loop through the table to find the right one and then remove it.
I can make the table into a sort of set as described here (Search for an item in a Lua list) which makes it extremely fast to remove the idxs but not so much to loop over them, since I may have to loop over all the empty gaps in the table which could be up to thousands long. Also, I'm actually going to have lots of these tables holding idxs so I don't think having that many tables of that length is a great idea.
There is possibly a third solution or maybe more but I'm not sure. What would be the best thing to do in this situation? Or is this a sign that I should redesign my program entirely?
As a second note, I would like to mention that removing idxs probably happens more times each frame than when I need to loop over them.
I can give more context if needed.

Related

Can two items in a hashmap be in different locations but have the same hashcode?

Can two items in a hashmap be in different locations but have the same hashcode?
I'm new to hashing, and I've recently learned about hashmaps. I was wondering whether two objects with the same hashcode can possibly go to different locations in a hashmap?
I'm not completely sure and would appreciate any help
As #Dai pointed out in the comments, this will depend on what kind of hash table you're using. (Turns out, there's a bunch of different ways to make a hash table, and no one data structure is "the" way that hash tables work!)
One of more common hash tables uses a strategy called closed addressing. In closed addressing, every item is mapped to a slot based on its hash code and stored with all other items that also end up in that slot. Lookups are then done by finding which bucket to look in, then inspecting all the items in that bucket. In that case, any two items with the same hash code will end up in the same bucket. (They can't literally occupy the same spot within that bucket, though.)
Another strategy for building hash tables uses an approach called open addressing. This is a family of different methods that are all based on the following idea. We require that each slot in the table store at most one element. As before, to do an insertion, we use the element's hash code to figure out which slot to put it in. If the slot is empty, great! We put the element there. If that slot is full, we can't put the item there. Instead, using some predictable strategy, we start looking at other slots until we find a free one, then put the item there. (The simplest way of doing this, linear probing, works by trying the next slot after the desired slot, then the next one, etc., wrapping around if need be.) In this system, since we can't store multiple items in the same spot, no, two elements with the same hash code don't have to (and in fact, can't!) occupy the same spot.
A more recent hashing strategy that's becoming more popular is cuckoo hashing. In cuckoo hashing, we maintain some small number of separate hash tables (typically, two or three), where each slot can only hold one item. To insert an element, we try placing it in the first table at a spot determined by its hash code. If that spot is free, great! We put the item there. If not, we kick out the item there and try putting that item in the next table. This process repeats until eventually everything comes to rest or we get caught in a loop. Like open addressing, this system prevents multiple items from being stored in the same slot, so two elements with the same hash code might go to different places. (There are variations on cuckoo hashing in which each table slot can store a fixed but small number of items, in which case you could have two items with the same hash code in the same spot. But it's not guaranteed.)
There are some other hashing schemes I didn't describe here. FKS perfect hashing works by using two layers of hash tables, along the lines of closed addressing, but periodically rebuilds the whole table to ensure that no one bucket is too overloaded. Extendible hashing uses a trie-like structure to grow overflowing buckets once they become too fully. Hopscotch hashing is a hybrid between linear probing and chained hashing and plays well with concurrency. But hopefully this gives you a sense of how the type of hash table you use influences the answer to your question!

Fast algorithm for approximate lookup on multiple keys

I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.

brainf_ck not operation in a list

I'm having a problem with the not operation (and nearly all operations) in a list. What I mean with a list is 0 i1 i2 i3 ... in-1 in 0 with a unknown n
In my program I'm at an unknown index in that list and I need to check if it is 0
For the not algorithm you need a temporary value but you can only get to that value with a [<] or a [>] but then you will lose the value in the list.
reminder: the a = 0 algorithm goes like this:
t0[-]+
a[t0-]
t0[
<code>
]
The only thing I could come up with is leaving a 1 between each index but that seems extremely un-elegant.
so my questions is : is there a better way to do this?
Actually the 1 between each element thing is really one of the more efficient ways to do it. Then you simply walk back and forth until you meet a zero and you know at which end of the sequence you are, and also how many there are. And they're really easy to clear up after each operation as well.
There are ways to use only one cell per element, but it would require moving all elements to the left of the one you want one position to the left, and then moving them all back, for each operation. In some cases this might be faster if you only store small values in each element and you have a lot of elements.
Depends what you want to achieve. Personally I think the first option of leaving a trail of 1s and clearing them afterwards is the better option, even though it requires twice the space, as it is usually significantly faster in the general case.

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

optimal algorithm for adding chosen table rows to the database

I am trying to apply a save method in a backing Java bean which will take the table rows that are selected and save them in the database. However, let's say the user changes his choices a little (changes 1 out of his 5 choices). I am wondering about the algorithm I am going to apply if it matters in efficiency in the long term or not....
here it goes :
every time the user clicks the button (save) I will delete all his previous choices and insert all the current choices to the database
once the button is clicked --- see which rows the user de-selected and delete their rows from the database and add the new ones???
is choice number 2 better or not than choice 1 .......or it doesn't really matter for number of choices that will not exceed 15 ??
Thanks
I would definitely go for option 2, try to figure out the minimum number of operations you need to perform.
It is, however, fairly normal to fall back to option 1 in times of deadlines etc. since it is a bit easier to implement.
There shouldn't, however, be that much harder to figure out what the changes are, since it doesn't seem to me that you're changing the rows themselves. Either you're deleting ones that had their checkmark cleared, or you insert ones that had their checkmark set.
Simply store a list of primary key values of whatever is in the database, then compare to that list when you iterate through the new list when the user wants to persist the changes.
A minimal work solution here would also mean you would be a bit more future-proof in terms of refactoring, changes, or additions. For instance, what if there in the future is data attached to any of those rows. You would need to keep that as well. Generally I'm a bit opposed to writing code just for the sake of "what if", but here I feel it's more like "why wouldn't you ..." than that.
So my advice is go for option 2. Not much more work.

Resources