Confusion about hash tables - data-structures

I am currently studying for some interviews, and I've heard that at some of these interviews people are sometimes asked to build a data structure from scratch, including a hash table. However, I am having some trouble ..really understanding hash tables from a programming perspective.
I've been building these data structures from scratch using C++, and I know that using templates I can create linkedlists, dynamic arrays, binary search trees, etc, that can basically store whatever type of object (as long as that object is the only type that can be stored in that instance of the hash table). So I would assume I could create a template or "generic" hash table that depending on the instance of the hash table, could store a particular object. But I have two things that confuse me:
I know that the through a hash function, the different keys are mapped to different indices in the array that makes up the hash table. But let's say you are using the hash table you created to store objects of type Book, and then let's say you create another hash table to store objects of type People. Obviously, different types of objects will have different member attributes, and one of these attributes would have to be the key. Would this mean that basically every object that you would ever want to store on the hash table you created would have to have at least one attribute that has the same name? Because your hash function would have to have some key value to hash, so it would have to know which attribute of the object it is using as a key to hash? So for example, every object that you would wanna store in this hash table would have to have an attribute called "key" that you can use when using a hash function to map to an index of the array, no? Otherwise, how would it know what "key" to hash?
This would also lead to the problem of the hash function...I've read that depending on the datasets you're given, some hash functions are better than other. So if the hash function depends on the dataset, how could you possibly create a hash table data structure that could store any type of object?
So am I just overthinking this? Should I just learn to create an easy hash table that hashes integers when practicing for my interviews? And are hash tables in real life created generically, or do people usually come up with a different hash table depending on the type of data they have?
If this question is better suited for the Computer Science theory stack exchange, please let me know. I am just finding these little details are keeping me from truly understanding this data structure.

You need to seperate the hash table from the hash function, these are different functionalities.
There are two common practices to keep your hash table generic and still be able to properly hash objects.
The first is to assume your template type (let it be T) implements
the hash method, and use it. You don't care how it is being
implemented, as long as you have it.
The other option is to have in addition to the template type, a
template function hash(T), that needed to be provided when
declaring a hash table.
This basically solves both problems: The user, who knows the data distribution better than the library reader, is supplying the hash function, and the supplied hash function works on the supplied type, regardless of what the "key" is.
If chosen the 2nd option, you could implement some default hash functions for the known and primitive types, so users won't need to reinvent the wheel for each usage of the library, when using standard types.

Related

What would be the most appropriate data structure given these requirements?

We are building Search API in our company for some of our entities - events, leagues and sports each of which has name property and we have difficulties implementing business requirements.
TL;DR; What will be the data structure addressing these business requirements better than basic Red-Black tree does?
What we are the business requirements?
Data structure needs to be sorted so following requirements are easier for implementation therefore insertion should not break this property.
Data structure needs to hold information about it's entities, so node key(entity's name property) will be used for searching, but the node needs to hold all the entities with name property starting with node key value.
Data structure needs to support deletion by id. Id is also a property of all entities.
It needs to support index search (up to 3 characters) so if someone searches for "aaa" every node with key between "aaaa.." and "aaaz" should appear. (ex. query = "aaa", index = "aaa", "aaab", "aaaab", "aaaz", result should be "aaa", "aaab", "aaaab").
We need to search by localized node key.
What we have done so far?
We started our first iteration using built-in red-black tree (SortedSet in C#) and for nodes we had structure that holds the name property of the entity and all related events to that name property. And with one helper method we satisfied business requirements (1), (2) and (4).
As our second iteration we had to support deletion so we created a map(Dictionary) of entity id's to references to entity objects put into the SortedSet. We do that because our request for deletion is only by id and we cannot recreate entity from id, so at addition we need to create such map. (maybe augumentation can help?) With this we secured requirement (3).
Now we need to support (5) however, with every iteration (business requirement we receive) it is getting harder and harder to implement and I almost feel like we need to change our data structure in order to address business criteria better.
Whats the problem with the localization?
We can create new SortedSet and re-use the implementation, but this comes with huge trade off. Let me elaborate.
We have 100 of clients, each of which has like 7-8 supported languages, languages in our system are unique per client so translations for one customer does not interfere with another (if someone wants to call it Soccer rather than Football, fine let it be.), besides that we have base languages (global for every client) which are basically default settings for newly create languages, so we can safely say that very large portion of client specific language (lets say english) is the same as the base one. Having said all of that, if we want to have accurate search for each client and locale individually we need to have index for each client and locale individually which on the other hand introduces massive amounts of duplication.
What I have thought so far?
I am not an expert in data structures myself, but I really want to make this right. Of course everything is possible with enough coding and hardware, but thats not the point.
I thought about implementing some binary tree (could be AVL, Red-Black, 2-3-4 etc.) and augment it to meet the requirements better than built in SortedSet does. This will hopefully solve a lot of the issue and workarounds we had to make so far and as I said address future requirements better so implementation is faster and more accurate, however like I said I am not an expert in data structures myself and sadly I am unable to map these business requirements to some data structure for the time frame I have, so without further a due, do you guys have any suggestions?
My suggestion here would be for your primary data structure to be a dictionary, keyed by product id, and the value is the product data. That gives you very quick insertion, and removal by product id.
For searching, provide a separate data structure that contains the product names and associated product ids.
class IndexEntry
{
string ProductName;
string ProductId; // or int, if ProductId is an integer
}
Since you allow customer-specific names, you'll have to add all those customer names to this index. Not a problem, but when you remove something by ID, you'll also have to remove the associated items from the other data structure. This will require a sequential search of the name index data structure to ensure that you get all the names associated with a particular product. That could be expensive, even if you use a tree structure.
To speed things up, you could have a "deleted" flag for those index entries, and then rebuild the structure periodically to remove the deleted items. That way, a deletion just requires a sequential scan. That's less than ideal, but if insertions and deletions are infrequent, quite acceptable.
The key, though, is to make your primary data structure that holds the product information indexed by product id. You can then build secondary indexes any way you want.

Hbase Schema Nested Entity

Does anyone have an example on how to create an Hbase table with a nested entity?
Example
UserName (string)
SSN (string)
+ Books (collection)
The books collection would look like this for example
Books
isbn
title
etc...
I cannot find a single example are how to create a table like this. I see many people talk about it, and how it is a best practice in certain scenarios, but I cannot find an example on how to do it anywhere.
Thanks...
Nested entities isn't an official feature of HBase; it's just a way some people talk about one usage pattern. In this pattern, you use the fact that "columns" in HBase are really just a big map (a bunch of key/value pairs) to let you to model a dimension of cardinality inside the row by adding one column per "row" of the nested entity.
Schema-wise, you don't need to do much on the table itself; when you create a table in HBase, you just specify the name & column family (and associated properties), like so (in hbase shell):
hbase:001:0> create 'UserWithBooks', 'cf1'
Then, it's up to you what you put in it, column wise. You could insert values like:
hbase:002:0> put 'UsersWithBooks', 'userid1234', 'cf1:username', 'my username'
hbase:003:0> put 'UsersWithBooks', 'userid1234', 'cf1:ssn', 'my ssn'
hbase:004:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_12345', '<isbn>12345</isbn><title>mary had a little lamb</title>'
hbase:005:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_67890', '<isbn>67890</isbn><title>the importance of being earnest</title>'
The column names are totally up to you, and there's no limit to how many you can have (within reason: see the HBase Reference Guide for more on this). Of course, doing this, you have to do your own legwork re: putting in and getting out values (and you'd probably do it with the java client in a more sophisticated way than I'm doing with these shell commands, they're just for explanatory purposes). And while you can efficiently scan just a portion of the columns in a table by key (using a column pagination filter), you can't do much with the contents of the cells other than pull them and parse them elsewhere.
Why would you do this? Probably just if you wanted atomicity around all the nested rows for one parent row. It's not very common, your best bet is probably to start by modeling them as separate tables, and only move to this approach if you really understand the tradeoffs.
There are some limitations to this. First, this technique only works to
one level deep: your nested entities can’t themselves have nested entities. You can still
have multiple different nested child entities in a single parent, and the column qualifier is their identifying attributes.
Second, it’s not as efficient to access an individual value stored as a nested column
qualifier inside a row, as compared to accessing a row in another table, as you learned
earlier in the chapter.
Still, there are compelling cases where this kind of schema design is appropriate. If
the only way you get at the child entities is via the parent entity, and you’d like to have transactional protection around all children of a parent, this can be the right way to go.

MVC Design: How many array controllers do I need?

I have a pretty straightforward MVC design question.
I've got a class with a bunch of properties, and a form to present an instance of the class. Several of the class properties are arrays - some are arrays of NSStrings to be presented in a one-dimensional table view, and some are arrays of sub-objects to be presented in a two-dimensional table view (one column per sub-object property). I don't actually want to do anything with the data in any of these tables - just present the contents in a scrollable, read-only table view.
During my first attempt at bindings, I added an object controller bound to the class instance. Then, I tried to bind each column of each table view to the "selection" member of the class, with a model key path specifying the array property of the instance (and, for the two-dimensional tables, a member of the sub-object). I was surprised that this didn't work for the columns of the one-dimensional tables.
Next, I added one array controller for every table, binding it to the "selection" member of the object controller. For the one-dimensional tables, I bound the column to the array controller with no model key path; for the two-dimensional tables, I bound the column to the array controller with a model key path specifying a property of the sub-object. This works - but for a window with seven tables, I have seven array controllers! That feels like overkill, since the tables aren't doing anything other than presenting data.
My question is simple: Is my design in line with good MVC practice - do I really need all of these array controllers? Or is there a simpler way to specify my bindings (for one-dimensional and/or two-dimensional tables) that will enable me to eliminate some array controllers? When I have an array of strings in an object to be displayed in a one-column table, it feels like overkill to use an array controller bound to the object and the table.
As an ancillary question - do I really need to worry about excessive array controllers? Are they lightweight objects that I should use liberally, or resource-intensive objects that I should conserve, especially in limited resource contexts like iOS?

would you use an array or a custom-made class for simple data manipulation? (ruby)

I can do bit of coding in ruby. I just touched objects and I am not so object literate, I mean I do not think in objects yet :-)
I have data that I scrape from the forum on regular basis. I need fields like
author, date posted, title, category, number of views, etc etc = array in my point of view.
Then I want to be able to these in ruby
save the whole lot (quick solution is csv or xml - later probably some sql database)
sort it by field
load/read my file to update fields and do some statistics, extract some data
add new fields easily in case I need to
edit, modify my "file/database" outside ruby.
I believe that I can do every operation like change the number of views of post, change the date of the last reply in the post etc etc either using array or object.
so my Question is: would you use
...................................... custom class/object or array?
could you tell why?
It would seem logical to me, at least, to make an object for storing and working with the data that you're scraping. Typically, you'd have instance variables for each of the fields that you have mentioned (author, title, category, views, date_posted) and probably some methods to populate them from the scraped data as well as read/write them.
In terms of storing the data for these objects, using an ORM such as ActiveRecord or DataMapper makes this very easy. An ORM let's you map the data in a data store, such as MySQL, to the corresponding Ruby objects. It will also provide a bunch of convenience methods for saving, updating and querying those objects.
However, it might be a good learning experience to try writing your own methods to map the data to XML files.
Do you mean "would you use an array or a custom-made class" do process this data.
What I would probably do is create a class that stores the data you want internally as an array or hash. You would then have methods of that class you could call to perform the tasks that you describe.
An object encapsulates data with behaviour i.e. functions or operations that can be performed on data. However, array is just a data structure that has a collection of element. Basically data structures expose data and have no meaningful functions.
Since you want to perform save, sort, update, stat, etc operations on your collected data so it makes sense to have a Post object with data/attributes (like author, date posted, title, category, etc.) and the operations/methods you would like to perform on your data. Abstracting the data and behaviour of your object into a class will make your code easy to maintain and understand where you can easily see the responsibility of the class by the methods defined in that class and how those methods change the state of your object by manipulating the object attributes/data.

How would you represent a relational entity as a single unit of retrievable data in BerkeleyDB?

BerkeleyDB is the database equivalent of a Ruby hashtable or a Python dictionary except that you can store multiple values for a single key.
My question is: If you wanted to store a complex datatype in a storage structure like this, how could you go about it?
In a normal relational table, if you want to represent a Person, you create a table with columns of particular data types:
Person
-id:integer
-name:string
-age:integer
-gender:string
When it's written out like this, you can see how a person might be understood as a set of key/value pairs:
id=1
name="john";
age=18;
gender="male";
Decomposing the person into individual key/value pairs (name="john") is easy.
But in order to use the BerkeleyDB format to represent a Person, you would need some way of recomposing the person from its constituent key/value pairs.
For that, you would need to impose some artificial encapsulating structure to hold a Person together as a unit.
Is there a way to do this?
EDIT: As Robert Harvey's answer indicates, there is an entity persistence feature in the Java edition of BerkeleyDB. Unfortunately because I will be connnecting to BerkeleyDB from a Ruby application using Moneta, I will be using the standard edition which I believe requires me to create a custom solution in the absence of this support.
You can always serialize (called marshalling in Ruby) the data as a string and store that instead. The serialization can be done in several ways.
With YAML (advantage: human readable, multiple implementation in different languages):
require 'yaml'; str = person.to_yaml
With Marshalling (Ruby-only, even Ruby version specific):
Marshal.dump(person)
This will only work if class of person is an entity which does not refer to other objects you want not included. For example, references to other persons would need to be taken care of differently.
If your datastore is able to do so (and BerkeleyDB does AFAICT) I'd just store a representation of the object attributes keyed with the object Id, without splitting the object attributes in different keys.
E.g. given:
Person
-id:1
-name:"john"
-age:18
-gender:"male"
I'd store the yaml representation in BerkleyDB with the key person_1:
--- !ruby/object:Person
attributes:
id: 1
name: john
age: 18
gender: male
Instead if you need to store each attribute as a key in the datastore (why?) you should make sure the key for the person record is somewhat linked to its identifying attribute, that's the id for an ActiveRecord.
In this case you'd store these keys in BerkleyDB:
person_1_name="john";
person_1_age=18;
person_1_gender="male";
Have a look at this documentation for an Annotation Type Entity:
http://www.oracle.com/technology/documentation/berkeley-db/je/java/com/sleepycat/persist/model/Entity.html

Resources