Why doesn't Datomic have an EATV index? - algorithm

I would think that a common operation on any DBMS, even Datomic, would be to retrieve the most recent values of attribute(s) of a given entity. E.g. show me Joe's (most recent) address.
Given the 4 available indices all have T at the end, it seems like this common operation would not be very efficient. For example, using EAVT, you would have to search through all of the values for a given entity-attribute pair, in order to find the one with the most recent T.
Is there something missing or wrong from this analysis? If not, doesn't that imply that there should be an EATV index?

Datomic's indexes are covering indexes - see the docs on this topic. You're not navigating multiple trees of pointers to flesh out an entity, you're actually retrieving the (sorted) datoms about an entity by navigating the index tree for EAVT (by E) and retrieving those datoms. In fact, entities themselves are just inferred from the datoms about them, they're not otherwise implemented.
To navigate EAVT, You navigate to the datoms about an E through the index tree and retrieve the leaf segment, with contains sorted E,A,V,Tx datoms about the entity for the current database (as of its basis-T). Remember also that Datomic supports cardinality many attributes.

It will be rare to have few entities, with few attributes and massive amounts of churn on the values. That would need to be the case for a EATV index to help.
Its the EA portion of the index that really matters for lookup speed. Taking the most recent value of all the attributes for a given entity is a rapid filter over a contiguous set of datoms an EAVT index (which like all indexes in datomic is a covering index meaning the ordered datoms are actually present within the index structure).

For finding the most recent value of an attribute you don't need to search the history db.
(d/q '[:find ?address
:where [?e :name "Joe"]
[?e :address ?address]]
db)
will give you the most recent address of Joe (in the db version provided to the query) and efficiently uses EAVT.
There's some more background on the topic on the Datomic google group.

For the retrieving of the most current value, Datomic doesn't have to iterate over all possible values: Datomic keeps the current values in a separate B-tree (called current part), so this should be really fast. For the further explanation, read this AWESOME blog:
http://tonsky.me/blog/unofficial-guide-to-datomic-internals/
However, why is EAVT preferred to EATV is unclear to me.
Also, it is unclear, how Datomic performs as-of queries. When as-of-ing Datomic has to join the history part and current part (terminology from the article mentioned above) which yields exactly to the problem you originally posed.

Related

Any reference to definition or use of the data structuring technique "hash linking"?

I would like more information about a data structure - or perhaps it better described as a data structuring technique - that was called hash linking when I read about it in an IBM Research Report a long time ago - in the 70s or early 80s. (The RR may have been from the 60s.)
The idea was to be able to (more) compactly store a table (array, vector) of values when most values fit in a (relatively) small compact range but some values (may) have had unusually large (or small) values out of that range. Instead of making each element of the table wider to hold the entire range you would store, in the table, only those values that fit in the small compact range and put all other entries that didn't fit into a hash table.
One use case I remember being mentioned was for bank accounts - you might determine that 98% of the accounts in your bank had balances under $10,000.00 so they would nicely fit in a 6-digit (decimal) field. To handle the very few accounts $10,000.00 or over you would hash-link them.
There were two ways to arrange it: Both involved a table (array, vector, whatever) where each entry would have enough space to fit the 95-99% case of your data values, and a hash table where you would put the ones that didn't fit, as a key-value pair (key was index into table, value was the item value) where the value field could really fit the entire range of the values.
You would pick a sentinel value, depending on your data type. Might be 0, might be the largest representable value. If the value you were trying to store didn't fit the table you'd stick the sentinel in there and put the (index, actual value) into the hash table. To retrieve you'd get the value by its index, check if it was the sentinel, and if it was look it up in the hash table.
You would have no reasonable sentinel value. No problem. You just store the exceptional values in your hash table, and on retrieval you always look in the hash table first. If the index you're trying to fetch isn't there you're good: just get it out of the table itself.
Benefit was said to be saving a lot of storage while only increasing access time by a small constant factor in either case (due to the properties of a hash table).
(A related technique is to work it the other way around if most of your values were a single value and only a few were not that value: Keep a fast searchable table of index-value pairs of the ones that were not the special value and a set of the indexes of the ones that were the very-much-most-common-value. Advantage would be that the set would use less storage: it wouldn't actually have to store the value, only the indexes. But I don't remember if that was described in this report or I read about that elsewhere.)
The answer I'm looking for is a pointer to the original IBM report (though my search on the IBM research site turned up nothing), or to any other information describing this technique or using this technique to do anything. Or maybe it is a known technique under a different name, that would be good to know!
Reason I'm asking: I'm using the technique now and I'd like to credit it properly.
N.B.: This is not a question about:
anything related to hash tables as hash tables, especially not linking entries or buckets in hash tables via pointer chains (which is why I specifically did not add the tag hashtable),
an "anchor hash link" - using a # in a URL to point to an anchor tag - which is what "hash link" gets you when you search for it on the intertubes,
hash consing which is a different way to save space, for much different use cases.
Full disclosure: There's a chance it wasn't in fact an IBM report where I read it. During the 70s and 80s I was reading a lot of TRs from IBM and other corporate labs, and MIT, CMU, Stanford and other university departments. It was definitely in a TR (not a journal or ACM SIG publication) and I'm nearly 100% sure it was IBM (I've got this image in my head ...) but maybe, just maybe, it was wasn't ...

What does Elasticsearch 5 do under the hood when sorting?

I read below words on Elasticsearch docs.
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-request-sort.html#_memory_considerations
When sorting, the relevant sorted field values are loaded into memory. This means that per shard, there should be enough memory to contain them.
This is different from my understanding about sorting. I thought that some datatype, keyword for example, should already be sorted since Elasticsearch will create index on them. These already sorted fields should not need to be load into memory to sort again.
So am I understand right?
Index in relational databases means B* tree and that is indeed sorted.
Index in Elasticsearch is where you store your data; previously we compared that to a table in the relational world but for various reasons this is not really true, so let's not use that as a direct comparison. Except for the index-time sorting Val mentioned above, an index is not stored as a sorted data structure based on a specific field. However, some fields can be used efficiently for sorting (like numeric data types or not analyzed text). And this is where the memory consideration from above comes into play.

Datastructure to store bidirectional relationship

I have googled and couldnt find any DS to store and read bi directional data in O(1) time. E.g. books and authors. With the name of book, authors has to be found. With the name of author, books has to be found.
How in DB, these relations like Join tables are stored?
thanks in advance.
The idea is a combination of the following:
A hash map of the first element to the second element (or a list of them)
A hash map of the second element to the first element (or a list of them)
Hash maps give expected O(1) lookup.
I don't believe databases typically use hash maps though, more typically b-trees as far as I know (giving O(log n) performance).
Expanding on Dukeling's answer, I believe Google has an implementation called HashBiMap: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/HashBiMap.html

Suitable data structure for finding a person's phone number, given their name?

Suppose you want to write a program that implements a simple phone book. Given a particular name, you want to be able to retrieve that person's phone number as quickly as possible. What data structure would you use to store the phone book, and why?
the text below answers your question.
In computer science, a hash table or hash map is a data structure that
uses a hash function to map identifying values, known as keys (e.g., a
person's name), to their associated values (e.g., their telephone
number). Thus, a hash table implements an associative array. The hash
function is used to transform the key into the index (the hash) of an
array element (the slot or bucket) where the corresponding value is to
be sought.
the text is from wiki:hashtable.
there are some further discussions, like collision, hash functions... check the wiki page for details.
I respect & love hashtables :) but even a balanced binary tree would be fine for your phone book application giving you in worst case a logarithmic complexity and avoiding you for having good hash functions, collisions etc. which is more suitable for huge amounts of data.
When I talk about huge data what I mean is something related to storage. Every time you fill all of the buckets in a hash-table you will need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. Balanced trees wont let you go into these problems. Domain needs to be considered too while designing data structures, for an example for small devices storage matters a lot.
I was wondering why 'Tries' didn't come up in one of the answers,
Tries is suitable for Phone book kind of data.
Also, saving space compared to HashTable at the same cost(almost) of Retrieval efficiency, (assuming constant size alphabet & constant length Names)
Tries also facilitate the 'Prefix Matches' sometimes required while searching.
A dictionary is both dynamic and fast.
You want a dictionary, where you use the name as the key, and the number as the data stored. Check this out: http://en.wikipedia.org/wiki/Dictionary_%28data_structure%29
Why not use a singly linked list? Each node will have the name, number and link information.
One drawback is that your search might take some time since you'll have to traverse the entire list from link to link. You might order the list at the time of node insertion itself!
PS: To make the search a tad bit faster, maintain a link to the middle of the list. Search can continue to the left or right of the list based on the value of the "name" field at this node. Note that this requires a doubly linked list.

Best data structure for a given set of operations - Add, Retrieve Min/Max and Retrieve a specific object

I am looking for the optimal (time and space) optimal data structure for supporting the following operations:
Add Persons (name, age) to a global data store of persons
Fetch Person with minimum and maximum age
Search for Person's age given the name
Here's what I could think of:
Keep an array of Persons, and keep adding to end of array when a new Person is to be added
Keep a hash of Person name vs. age, to assist in fetching person's age with given name
Maintain two objects minPerson and maxPerson for Person with min and max age. Update this if needed, when a new Person is added.
Now, although I keep a hash for better performance of (3), I think it may not be the best way if there are many collisions in the hash. Also, addition of a Person would mean an overhead of adding to the hash.
Is there anything that can be further optimized here?
Note: I am looking for the best (balanced) approach to support all these operations in minimum time and space.
You can get rid of the array as it doesn't provide anything that the other two structures can't do.
Otherwise, a hashtable + min/max is likely to perform well for your use case. In fact, this is precisely what I would use.
As to getting rid of the hashtable because a poor hash function might lead to collisions: well, don't use a poor hash function. I bet that the default hash function for strings that's provided by your programming language of choice is going to do pretty well out of the box.
It looks like that you need a data structure that needs fast inserts and that also supports fast queries on 2 different keys (name and age).
I would suggest keeping two data structures, one a sorted data structure (e.g. a balanced binary search tree) where the key is the age and the value is a pointer to the Person object, the other a hashtable where the key is the name and the value is a pointer to the Person object. Notice we don't keep two copies of the same object.
A balanced binary search tree would provide O(log(n)) inserts and max/min queries, while the hastable would give us O(1) (amortized) inserts and lookups.
When we add a new Person, we just add a pointer to it to both data structures. For a min/max age query, we can retrieve the Object by querying the BST. For a name query we can just query the hashtable.
Your question does not ask for updates/deletes, but those are also doable by suitably updating both data structures.
It sounds like you're expecting the name to be the unique idenitifer; otherwise your operation 3 is ambiguous (What is the correct return result if you have two entries for John Smith?)
Assuming that the uniqueness of a name is guaranteed, I would go with a plain hashtable keyed by names. Operation 1 and 3 are trivial to execute. Operation 2 could be done in O(N) time if you want to search through the data structure manually, or you can do like you suggest and keep track of the min/max and update it as you add/delete entries in the hash table.

Resources