hash table index design - algorithm

I want to use a hash table to store words.
For example, I have two words aba and aab ,because they are made up of the same elements just in different order , so I want to store them with the same index, and plug a link list at that link list. It's easy for me to search in a certain way. The elements of words are just 26 letters. How to design a proper index of the hash table? How to organize the table?

So the questions you want to answer with your hash table is: What words can be built with the letters I have?
I assume you are reading some dictionary and want to put all the values in the hash table. Then you could use an int array with the count how many times each letter occurs as key (e.g. 'a' would be index 0 and 'z' index 25) and for the value you would have to use a list, so that you can add more than one word to that entry.
But the simplest solution is probably just to use the sorted word as key (e.g. 'aba' gets key 'aab' and 'aab' obviously too), because words are not very long the sort isn't expensive (avoid creating new strings all the time by working with the character array).
So in Java you could get the key like this:
char[] key = word.toCharArray();
Arrays.sort(key);
// and if you want a string
String myKey = new String(key);

Related

How do relational databases perform ORDER BY on multiple columns with secondary indices?

I'm trying to understand the sorting algorithm behind SQL ORDER BY clauses in the case that the properties are indexed.
Secondary indices are usually implemented as B+ Trees with a combined key, consisting of the indexed value and the associated primary key. For example, an index on first name may look like this:
Key
Value
John.id4
null
John.id5
null
Jane.id16
null
...
....
The task that sorting needs to perform is: given a set of IDs and a list of sort commands (consisting of column and ASC/DESC), sort the IDs.
If we only want to sort by a single column (e.g. ORDER BY FirstName), the algorithm is easy:
Iterate over the secondary index.
If the ID part of the Key occurs in the input set, remove it from the set and add it to the (sorted) output list
Stop if the input set becomes empty or the index has reached its end, whichever occurs first
Return the output list.
But how does the same thing work if we have multiple sortings? For example, the clause ORDER BY FirstName ASC LastName ASC? The main issue is of course that we cannot simply tie-break between two IDs simply by looking them up in the second index, because it's sorted by index value, not by primary key. We will have to minimize the number of scans per index as much as possible.
How do big databases, such as PostGreSQL or MySQL solve this issue?

Get the full word by few letters, API?

I have a problem: for example, we have a few letters: b,o,s. It's a letters from some word and they go in the same order as in word (in this case word is books). But, of course, it may be another word.
So I need to get the list of the possible words, for example, lenght = 10. How can I do these? I feel, that the problem is close to crossword solving, so may be there is some services with API?
Get yourself a word list https://www.google.com/?gws_rd=ssl#q=english+word+list
Put the words in a database table CREATE TABLE Words(Word VARCHAR(64) PRIMARY KEY) ... INSERT INTO Words(Word) VALUE(UPPER(#word)) ...
Query the table with a LIKE query SELECT Word FROM Words WHERE Word LIKE 'B%O%S'

How does RethinkDB generate auto ids?

I'm writing a script which supposed to merge some data from sql-based db. Each row has a long-integer as a primary key (incremental). I was thinking about hashing these ids so that they'll somehow 'look' like the other ids already in my RethinkDB table. What I'm trying to achive here is to avoid dups in case of an attempt to merge the same data again, but keeping the original integers as ids along with the generated ids of the data saved directly to RethinkDB's table feels weird.
Can I do that?
How does RethinkDB generate auto ids anyways?
And am I approaching this correctly..?
RethinkDB uses a string-encoding of 128 bit UUIDs (basically hashed integers).
The string format looks like this: "HHHHHHHH-HHHH-HHHH-HHHH-HHHHHHHHHHHH" where every 'H' is a hexadecimal digit of the 128 bit integer. The characters 0-9 and a-f (lower case) are used.
If you want to generate such UUIDs from an existing integer, I recommend hashing the integer first. This will give you an even distribution over the whole key space (this makes sharding easier and avoids hotspots).
As a second step you have to format the hash value in a string of the format shown above. If you don't have enough digits, it's fine to leave some of the last 'H' as constant 0.
If you really want to go into the details of UUID generation, here are two links for further reading:
RFC 4122 "A Universally Unique IDentifier (UUID) URN Namespace" https://www.rfc-editor.org/rfc/rfc4122
RethinkDB's implementation of UUID generation and formatting https://github.com/rethinkdb/rethinkdb/blob/next/src/containers/uuid.cc

LevelDB: Iterate keys by insertion order

What's a good strategy for generating auto-incrementing keys in LevelDB? My goal is to be able to iterate over the keys in the order that they were inserted.
two methods:
use the default comparator, but use a function to convert the index key '1' to something like '000000001', convert '20' to '000000020', so leveldb will place them near each other;
self define a new comparator, which convert the key from type string to type integer, then you can compare the integer.
with any of the above 2 methods, you need to store a key-value pair in the leveldb: current_id ----> integer, or you can store the current id in a new file using mmap.
then, with yourself defined Add() function, after you get the current id from key current_id you can insert a new key-value pair: id ----> value, then you can update the current_id to plus one.
Since a LevelDB instance can only be accessed from one application at a time, you might as well use a 64-bit long and increment it in the application. When opening the DB (and before allowing any writes), to find the last inserted key you can use the SeekToLast() method of the Iterator.
As I just pointed out in a question on integer keys, if you want to use binary integers you need to create a custom Comparator for the database, otherwise you don't get them in ascending binary order. It's not hard but you may have overlooked the need.
I'm not quite sure what you're asking. If the only data you are adding is keys which are supposed to record an entry as a log then yes, just use an integer key.
However, if you are inserting keys you are going to search for some other reason PLUS you want to later iterate them in insertion order, it gets a bit more complex.
Basically you want to insert two keys for each key value, using a prefix to determine whether keys are "value keys" or "ordering keys". e.g., say you have Frank, John, Sally and Amy as keys and use prefix ~N for Name keys and ~I for Iterator keys.
The database looks like the following, note that the "Iterator keys" don't have a value associated with them as we can just get the names out of the key. I've shown it as if you used a string of two digits for the number, rather than using an integer value and needing a special Comparator.
~I00Frank
~I01John
~I02Sally
~I03Amy
~NAmy => Amy's details
~NFrank => frank's details
~NJohn => John's details
~NSally => Sally's details

Fastest way to find records that end with key

I'm looking for optimal way to search through millions of records that contain serial number saved as varchar column which ends with specified string key.
I was using EndsWith, however performance is rather poor if several queries are sent.
Is there a better way to do it?
EDIT:
Since search key is of variable length, I can't create column that holds cut-off value of serial number. However, I've done some tests with using Substring and Equals vs EndsWith and I've lowered down execution speed to 40% of the one of EndsWith.
I'm still looking for better solution though :)
Unfortunately, searching for strings ending with a particular pattern is difficult on most databases+, because searching for string suffixes cannot use an index. This results in full table scans, which may be slow on tables with millions of rows.
If your database supports reverse indexes, add one for your string key column; otherwise, you can improve performance by simulating reverse indexes:
Add a column for storing your string key in reverse
If your RDBMS supports computed columns, add one for the reversed key
Otherwise, define a trigger that populates the reversed column from the key column
Create an index on the reversed column
Use the reversed column for your searches by passing in the reversed suffix that you are looking for.
For example, if you have data like this
key
-----------
01-02-3-xyz
07-12-8-abc
then the augmented table would have
key rev_key
----------- -----------
01-02-3-xyz zyx-3-20-10
07-12-8-abc cba-8-21-70
and your search for ENDS_WITH(key, '3-xyz') would ask for STARTS_WITH(rev_key, 'zyx-3'). Since string indexes speed up lookups by prefix, the "starts with" lookup would go much faster.
+ One notable exception is Oracle, which provides reverse key indexes specifically for situations like this.

Resources