I'm looking for a scheme for assigning keys to rows in a table that would allow rows to be moved around and assigned new locations in the table without having to renumber the entire table.
Something like having keys 1, 2, 3, 4, then moving row "2" between 3 and 4 and then renaming it "3.5" (so you end up with 1, 3, 3.5, 4). But the scheme needs to be "infinitely" extensible (permitting at least a few thousand "random" row moves before it would be normally be necessary to "normalize" the keys, and worst (most pathological) case allowing 25-50 such moves).
And the keys produced should be easily sorted, ideally I'd like them to be "naturally" ordered for a database (assume SQLite) query.
Any ideas?
This problem reminds me of the line numbering problem when a person was writing code in BASIC. What most people did in this situation was take an educated guess on how many lines might be inserted in between two lines. Then that guess would be the spacing between those lines. So if you think you might have 2000 inserts between two elements, then you might make element1 have a key of 2000 and make element2 have a key of 4000. Then we you want to put an element between element1 or element2 you either naively split the difference (3000) or if you have some intuition about how many elements would go on each side of element3, then you might weight it some (i.e. 3500 instead of 3000).
Another alternative (its really just the same thing but you are using a different numbering system) is to use floating point numbers which I believe you eluded to. Between 1 and 2 would be 1.5. Between 1.5 and 2 would be 1.75. Between 1.5 and 1.75 would be 1.625, etc.
I would recommend against a key that is a string. It is better to stick with numeric keys, and on top of that it is probably better to have integer type keys rather than floating point type keys if you can help it.
Conceptually, you could treat your table like a linked list. Create a table with a unique ID, the key and it's next node and whatever other data you want. Simply insert items sequentially, when you need to put a new item in between, simply swap the key values and the associated parent nodes. The key values won't remain consistent, but that is what the additional unique ID is for and this works fine for ordering by the key as well.
Really, since you have order already specified by the key, you don't even need the 'next node'. Your scheme as described above should be fine as long as you rename the keys of the other nodes in addition to the one you moved - i.e., 2 and 3 get their key values swapped.
Related
Can two items in a hashmap be in different locations but have the same hashcode?
I'm new to hashing, and I've recently learned about hashmaps. I was wondering whether two objects with the same hashcode can possibly go to different locations in a hashmap?
I'm not completely sure and would appreciate any help
As #Dai pointed out in the comments, this will depend on what kind of hash table you're using. (Turns out, there's a bunch of different ways to make a hash table, and no one data structure is "the" way that hash tables work!)
One of more common hash tables uses a strategy called closed addressing. In closed addressing, every item is mapped to a slot based on its hash code and stored with all other items that also end up in that slot. Lookups are then done by finding which bucket to look in, then inspecting all the items in that bucket. In that case, any two items with the same hash code will end up in the same bucket. (They can't literally occupy the same spot within that bucket, though.)
Another strategy for building hash tables uses an approach called open addressing. This is a family of different methods that are all based on the following idea. We require that each slot in the table store at most one element. As before, to do an insertion, we use the element's hash code to figure out which slot to put it in. If the slot is empty, great! We put the element there. If that slot is full, we can't put the item there. Instead, using some predictable strategy, we start looking at other slots until we find a free one, then put the item there. (The simplest way of doing this, linear probing, works by trying the next slot after the desired slot, then the next one, etc., wrapping around if need be.) In this system, since we can't store multiple items in the same spot, no, two elements with the same hash code don't have to (and in fact, can't!) occupy the same spot.
A more recent hashing strategy that's becoming more popular is cuckoo hashing. In cuckoo hashing, we maintain some small number of separate hash tables (typically, two or three), where each slot can only hold one item. To insert an element, we try placing it in the first table at a spot determined by its hash code. If that spot is free, great! We put the item there. If not, we kick out the item there and try putting that item in the next table. This process repeats until eventually everything comes to rest or we get caught in a loop. Like open addressing, this system prevents multiple items from being stored in the same slot, so two elements with the same hash code might go to different places. (There are variations on cuckoo hashing in which each table slot can store a fixed but small number of items, in which case you could have two items with the same hash code in the same spot. But it's not guaranteed.)
There are some other hashing schemes I didn't describe here. FKS perfect hashing works by using two layers of hash tables, along the lines of closed addressing, but periodically rebuilds the whole table to ensure that no one bucket is too overloaded. Extendible hashing uses a trie-like structure to grow overflowing buckets once they become too fully. Hopscotch hashing is a hybrid between linear probing and chained hashing and plays well with concurrency. But hopefully this gives you a sense of how the type of hash table you use influences the answer to your question!
What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.
I there guys,
i'm developing a small program in C, that reads strings from a .txt file with 2 letters and 3 numbers format. Like this
AB123
I developed a polynomial hash function, that calculates an hash key like this
hash key(k) = k1 + k2*A² + k3*A^3... +Kn*A^n
where k1 is the 1º letter of the word, k2 the 2º (...) and A is a prime number to improve the number of collisions, in my case its 11.
Ok, so i got the table generated, i can search in the table no problem, but only if i got the full word... That i could figure it out.
But what if i only want to use the first letter? Is it possible to search in the hash table, and get the elements started by for example 'A' without going through every element?
In order to have more functionality you have to introduce more data structures. It all depends on how deep you want to go, which depends on what exactly you need to code to do.
I suspect that you want some kind of filtering for the user. When user enters "A" it should be given all strings that have "A" at the start, and when afterwards it enters "B" the list should be filtered down to all strings starting with "AB".
If this is the case then you don't need over-complicated structures. Just iterate through the list and give the user the appropriate sublist. Humans are slow, and they won't notice the difference between 3 ms response and 300 ms response.
If your hash function is well designed, every place in the table is capable of storing a string beginning with any prefix, so this approach is doomed from the start.
It sounds like what you really want might be a trie.
I understood the basic rationale for a reverse key index that it will reduce index contention. Now if I have 3 numbers in the index: 12345, 27999, 30632, i can see that if i reverse these numbers, the next number in the sequence won't always hit the same leaf block.
But if the numbers were like :12345,12346,12347, then the next numbers 12348,12349 (incremented by 1) would hit the same leaf block even if the index is reversed:
54321,64321,74321,84321,94321.
So how is the reverse index helping me? It was supposed to help particularly while using sequences
If we're talking about a sequence-generated value, you can't look at 5 values and draw too many conclusions. You need to think about the data that has already been inserted and the data that will be inserted in the future.
Assuming that your sequence started at 12345, the first 5 values would be inserted sequentially. But then the sixth value will be 12350. Reverse that and you get 05321 which would go to the far left of the index. Then you'd generate 12351. Reverse that to get 15321 and that's again toward the left-hand side of the index between the first value you generated (54321) and the most recent value (05321). As the sequence generates new values, they'll go further to the right until everything resets every 10 numbers and you're inserting into the far left-hand side of the index again.
Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing s.th. wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
UPDATE
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
key VARCHAR,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family