Why even have hash tables? - data-structures

Hash tables allow mapping keys to values by using a hashing function. Here the hashing function actually computes the index of the key mapped to a specific value. But I just can't get my head around why we even use hash tables it in the first place? Why do you need a hash table? Are maps/dictionaries not good enough? Why not declare a dictionary ({'key1': 'value1'} in Python) and use it in the places where a hash table is required? I read a lot about it and still don't get it. Can you help me understand this?

why do you need a hashtable, is the map/dictionary not good
This is like asking why you need an automotive engine, isn't a car good enough? An engine is how a car works; you just don't see the engine when you are driving the car. But if you are learning to become an automotive engineer, then you should learn how engines work and how to design, build and maintain them.
Likewise, a hash table is how a dictionary works, you just don't see the hash table if you are writing code that uses a dictionary. But if you are learning to become a computer scientist, then you should learn how hash tables and other data structures work, and how to design, build and maintain them.

Related

Basic differences between HashTable and HashMap?

I am researching about hash tables and hash maps, everything I have read or watched gives a very vague description of the differences. From messing around on Netbeans with them both, they seem to have the same functions and do the same things, what are the fundamental differences between these two data structures?
There are no differences, but you can find that the same thing called differently in different programming languages, so how people call something depends on their background and programming language they use. For example: in c++ it will be HashMap and in java it will be HashTable.
Also, there could be one difference concluded based on the naming: HashTable allows only store hashed keys, but not values whereas HashMap allows to retrieve a value by hashed key. Internally the both will use the same algorithm and can be considered as same data structure.
HashTable sounds to me like a concrete data structure, although it has numerous variants depending on what happens when a collision occurs, when the table fills up, when it empties.
Map sounds like a abstract data structure, something defined by the available operations (Dictionary would be a potential other name for the same data structure, but I'd not be surprised if some nomenclature defined both with a nuance somewhere).
HashMap sounds like an implementation of the Map abstract data structure using an HashTable concrete data structure.
Again, I'd not be surprised if a language or a library provided both, with a nuance somewhere (HashMap for instance could provide only the operations defined for a Map, but HashTable provides everything which make sense for an HashTable).

logstash-filter-translate performance: how many keys would be too much keys in a ruby Hash?

logstash-filter-translate its a great filter to perform key/value lookups, but I am concerned about performance at scale and having many keys, let's say 50k keys stored on SSD, for what I understand reading the source code, a ruby Hash is pre-populated with the file content (in case of dictionary_path) and the lookup is made first ensuring key exists using a regex match and followed by the get it in case exists. I suppose this is mostly a Ruby question now but I wanted to ask for experiences.

Caching sortable/filterable data in Redis

I have a variety of data that I've got cached in a standard Redis hashmap, and I've run into a situation where I need to respond to client requests for ordering and filtering. Order rankings for name, average rating, and number of reviews can change regularly (multiple times a minute, possibly). Can anyone advise me on a proper strategy for attacking this problem? Consider the following example to help understand what I'm looking for:
Client makes an API request to /api/v1/cookbooks?orderBy=name&limit=20&offset=0
I should respond with the first 20 entries, ordered by name
Strategies I've considered thus far:
for each type of hashmap store (cookbooks, recipes, etc), creating a sorted set for each ordering scheme (alphabetical, average rating, etc) from a Postgres ORDER BY; then pulling out ZRANGE slices based on limit and offset
storing ordering data directly into the JSON string data for each key.
hitting postgres with an SELECT id FROM table ORDER BY _, and using the ids to pull directly from the hashmap store
Any additional thoughts or advice on how to best address this issue? Thanks in advance.
So, as mentioned in a comment below Sorted Sets are a great way to implement sorting and filtering functionality in cache. Take the following example as an idea of how one might solve the issue of needing to order objects in a hash:
Given a hash called "movies" with the scheme of bucket:objectId -> object, which is a JSON string representation (read about "bucketing" your hashes for performance here.
Create a sorted set called "movieRatings", where each member is an objectId from your "movies" hash, and its score is an average of all rating values (computed by the database). Just use a numerical representation of whatever you're trying to sort, and Redis gives you a lot of flexibility on how you can extract the slices you need.
This simple scheme has a lot of flexibility in what can be achieved - you simply ask your sorted set for a set of keys that fit your requirements, and look up those keys with HMGET from your "movies" hash. Two swift Redis calls, problem solved.
Rinse and repeat for whatever type of ordering you need, such as "number of reviews", "alphabetically", "actor count", etc. Filtering can also be done in this manner, but normal sets are probably quite sufficient for that purpose.
This depends on your needs. Each of your strategies could work.
Your first approach of storing an auxiliary sorted set for each way
you want to order is the best way to do this if you have a very big
hash and/or you run your order queries frequently. This approach will
require a lot of ram if your hash is big, but it will also scale well
in terms of time complexity as your hash gets bigger and you start
running order queries more frequently. On the other hand, it
introduces complexity in your data structures, and feels like you're
trying to use Redis for something a typical DB like Postgres, MySQL,
or Mongo would be better at.
Storing ordering data directly into your keys means you need to pull
your entire hash every time you do an order query. Maybe that's not
so bad if your hash is very small, or you don't do ordered queries very often, but this won't scale at all.
If you're already hitting Postgres to get keys, why not just store the values in Postgres as well. That would be much cheaper than hitting Postgres and then hitting Redis, and would have your code depend on fewer things. IMO, this is probably your best option and would work most naturally. Do this, unless you have some really good reason to not store values in Postgres, or some really big speed concerns, in which case go with your first strategy.

Hashes: Tables, Lists and Maps, Oh My?

I've been trying to find some concrete (laymen; non super-academic) definitions for the various types of hash data structures, specifically hash tables, hash lists and hash maps. Online searches provide many useful links to all of these, but never give clear definitions of when it is appropriate to use each over the others.
(1) From a practical standpoint, what's the difference between these 3?
(2) How do their operations' run times differ? Are there clear instances when one should be used or avoided over the other types of hashes?
(3) How do each of these relate back to the Map ADT? Are they all just different implementations of it, or different beasts altogether?
Thanks for any insight here!
There's an abstract data structure that contains mapping between keys and values. It has several different names, including Map, Dictionary, Table, Association Table, and more.
The most basic operations that should be supported by this data-structure are adding, removing and retrieving a value, given its associated key. There are variations and additions around this basic concept - for instance, some structures support iterating over all the key-value pairs, some structures support multiple values per key, etc. There's also a difference in time and space complexity between the various implementations.
Of the multiple implementations available for this data structure, some of the most popular ones utilize hash functions for fast access times. Those implementations are sometimes called by the name Hash Table or Hash Map, you can read more about them in Wikipedia. The performance also varies between hash table implementations, with some reaching amortized O(1) insertion and access complexity (for the price of a lot of space used).
A hash list, on the other hand, is a different thing, and is more about the usage of a data structure, than its actual structures. A hash list is usually just a regular list of hash values, nothing special about it. It's used when verifying the integrity of a large piece of data - in that case it allows various data chunks to be verified independently, allowing for fixing or retrieving of just the bad chunks. This is as opposed to using a single hash value to hash the entire piece of data, in which case a failure means all the data has to be fixed or retrieved again.

Algorithm to organize table into many tables to have less cells?

I'm not really trying to compress a database. This is more of a logical problem. Is there any algorithm that will take a data table with lots of columns and repeated data and find a way to organize it into many tables with ID's in such a way that in total there are as few cells as possible, and that this tables can be then joined with a query to replicate the original one.
I don't care about any particular database engine or language. I just want to see if there is a logical way of doing it. If you will post code, I like C# and SQL but you can use any.
I don't know of any automated algorithms but what you really need to do is heavily normalize your database. This means looking at your actual functional dependencies and breaking this off wherever it makes sense.
The problem with trying to do this in a computer program is that it isn't always clear if your current set of stored data represents all possible problem cases. You can't only look at numbers of values either. It makes little sense to break off booleans into their own table because they have only two values, for example, and this is only the tip of the iceberg.
I think that at this point, nothing is going to beat good ol' patient, hand-crafted normalization. This is something to do by hand. Any possible computer algorithm will either make a total mess of things or make you define the relationships such that you might as well do it all yourself.

Resources