Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I try compare b-tree and hash table look up time complexity.
B-tree needs log_b(n) operations and log_b(n) <= b if n <= b^b so for b = 10 it is 10^10 at any case and I have 10 operations for look up.
Hash table needs 1 operation for look up in average. But if I have a 10^10 keys and size of my hash table is 10^10/10 then it will be 10 operation for look up in average case (for separate chaining), or not?
I think it is a lot theoretical. I want know, what is better in practice? why?
what is better in practice?
It depends.
A b-tree is always O(log n) performance.
A hash table is O(1) (much better than the b-tree) with
A good hash function for your data.
Enough hash buckets.
If those criteria are not met then the hash table will tend towards O(n) (ie. much worse than the b-tree).
Summary: good hash function: hash table will usually be better. A b-tree is consistent without needing a hash function.
In practice n is not large, and even a generic hash will be good enough to achieve close enough to O(1) that spending time on the question is a pointless optimisation.
Real answer: until you measure performance and determine that data structure lookup times are significant put your optimisation effort where your users will see a significant difference.
You cannot easily compare them because they provide different functionality. The hash table is a key-value store while the tree also allows lookup based on order (previous/next, etc).
Rule of thumb: If you want to use them for a specific task, just measure which one is better.
Note: those numbers are huge, does it even fit into the memory of your machine?
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
To all he-experts out there:
I want to implement a matrix-vector multiplication with very large matrices (600000 x 55). Currently I am able to perform he operations like Addition, Multiplication, InnerProduct etc. with small inputs. When I try to apply these operations on larger inputs I get errors like Invalid next size (normal) or I ran out of main memory until the os kills the process (exit code 9).
Do you have any recommendations/examples how to archive an efficient way of implementing a matrix-vector multiplication or something similar? (Using BFV and CKKS).
PS: I am using the PALISADE library but if you have better suggestions like SEAL or Helib I would happily use them as well.
CKKS, which is also available in PALISADE, would be a much better option for your scenario as it supports approximate (floating-point-like) arithmetic and does not require high precision (large plaintext modulus). BFV performs all operations exactly (mod plaintext modulus). You would have to use a really large plaintext modulus to make sure your result does not wrap around the plaintext modulus. This gets much worse as you increase the depth, e.g., two chained multiplications.
For matrix-vector multiplication, you could use the techniques described in https://eprint.iacr.org/2019/223, https://eprint.iacr.org/2018/254, and the supplemental information of https://eprint.iacr.org/2020/563. The main idea is to choose the right encoding and take advantage of SIMD packing. You would work with a power-of-two vector size and could pack the matrix either as 64xY (multiple rows) per ciphertext or a part of each row per ciphertext, depending on which one is more efficient.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm looking to create a library where one of the functions users need to do is store and retrieve data, along with an index.I don't know which they'll be doing more of: inserting, reading/writing, deleting, or random searching.
What kind of data structure would you use so they get the best performance in general? How would your proposed data structure compare performance wise in each scenario?
Thinking hash table or avl tree? Or something like a combo of data structures? Linked list of arrays?
What would be cool is if it self optimized, so it sees the user is doing more inserts or reads or random searches, so future inserts are optimized for that.
There's no single best data structure out there that does this or I'd promise that everyone would be using it. However, there are a couple of very reasonable options available.
The first question to think about is what do you need to do with the data? If you're just storing items and looking them up later on, and all you need to do is add, remove, and look up items, then you might want to look more toward various flavors of hash tables. On the other hand, if you're looking for the ability to process items in sorted order, then hash tables are probably out and you should probably look more toward balanced trees.
The next question is what type of data you're storing. If each item has some associated key, what kind of key is it? Both hash tables and BSTs are great in general, but more specialized data structures exist as well that work specifically for string keys (tries) and other types like integers.
From there you should think about how much data you're storing. If you're storing a couple hundred megabytes and things fit comfortably in RAM, you might not need to do anything special here. But if you have a truly huge amount of data and things don't fit into RAM, you'll need to look into external data structures like B-trees.
Another question to consider is what kind of performance guarantees you want. Most hash tables require some sort of dynamic resizing as the number of items increases, which can lead to infrequent but expensive rebuild operations that can slow things down. If you absolutely need real-time performance, this won't work for you. If you're okay with that, then go for it!
And let's suppose you've then narrowed things down to, say, "a hash table" or "a balanced BST." Now you have to select which type to use! For hash tables, simple structures like linear probing hash tables or chained hashing often need some performance tuning to be maximally efficient. Newer approaches like cuckoo hashing can give better memory performance in some cases, while engineered approaches like Google's flat_hash_map are extremely optimized for the x86 architecture. For BSTs, you might want something like an AVL tree if you have way more lookups than insertions or deletions, since AVL trees have a low height, but you might also want to look at red/black trees if insertions and deletions are more common, and perhaps into more modern trees like RAVL or WAVL trees if you really have a lot of deletions.
All of this is to say that the answer is "it depends." The more you know about your particular application, the better a data structure you'll be able to pick. And, sadly, there is no One Data Structure To Rule Them All. :-)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Lets say I'm creating a hash between 7 and 8 million elements using linear probing to handle collisions. How do I figure out how many buckets are required?
There is no perfect answer... the number of buckets affects both memory usage and performance, and the more collision prone the specific elements are (in combination with your hash function and table size - e.g. a prime number of buckets tends to be more tolerant than a power-of-2) the more buckets you may want.
So, the best way if you need accurate tuning is to get realistic data and try a range of load factors (i.e. # elements to # buckets), seeing where the memory/performance tradeoff suits you best.
If you just want a generally useful load factor as a point of departure, perhaps try .7 to .8 if you've a half-way decent hash function. In other words, an oft-sane ballpark figure for number of buckets would be 8 million / .7 or / .8 which is ~10 to 11.4 million.
If you're serious about tuning this aggressively, and don't have other good reasons for sticking with it (e.g. to support element deletions using immediate compaction rather than "tombstone"s marking once-used buckets over which element lookups/deletions must skip and continue probing), you should move off linear probing as it'll give you a lot more collisions than most-any alternatives.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I found the below in a question bank and I'm looking for some help with it.
For each of the following situations, select the best data structure and justify your selection.
The data structures should be selected from the following possibilities: unordered list, ordered array, heap, hash table, binary search tree.
(a) (4 points) Your structure needs to store a potentially very large number of records, with the data being added as it arrives. You need to be able to retrieve a record by its primary key, and these keys are random with respect to the order in which the data arrives. Records also may be deleted at random times, and all modifications to the data need to be completed just after they are submitted by the users. You have no idea how large the dataset could be, but the data structure implementation needs to be ready in a few weeks. While you are designing the program, the actual programming is going to be done by a co-op student.
For the answer, I thought BST would be the best choice.
Since the size is not clear, hashtable is not a good choice.
Since there is a matter of deletion, heap is not acceptable either.
Is my reasoning correct?
(b) (4 points) You are managing data for the inventory of a large warehouse store. New items (with new product keys) are added and deleted from the inventory system every week, but this is done while stores are closed for 12 consecutive hours.
Quantities of items are changed frequently: incremented as they are stocked, and decremented as they are sold. Stocking and selling items requires the item to be retrieved from the system using its product key.
It is also important that the system be robust, well-tested, and have predictable behaviour. Delays in retrieving an item are not acceptable, since it could cause problems for the sales staff. The system will potentially be used for a long time, though largely it is only the front end that is likely to be modified.
For this part I thought heapsort, but I have no idea how to justify my answer.
Could you please help me?
(a) needs fast insertion and deletion and you need retrieval based on key. Thus I'd go with a hashtable or a binary search tree. However, since the size is not known in advance and there's that deadline constraint, I'd say the binary search tree is the best alternative.
(b) You have enough time to process data after insertion/deletion but need an O(1) random access. An ordered array should do the trick.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Cross posted: Need a good overview for Succinct Data Structure algorithms
Since I knew about Succinct Data Structures I'm in a desperate need of a good overview of most recent developments in that area.
I have googled and read a lot of articles I could see in top of google results on requests from top of my head. I still suspect I have missed something important here.
Here are topics of particular interest for me:
Succinct encoding of binary trees with efficient operations of getting parent, left/right child, number of elements in a subtree.
The main question here is as following: all approaches I know of assume the tree nodes enumerated in breath-first order (like in the pioneer work in this area Jacobson, G. J (1988). Succinct static data structures), which does not seem appropriate for my task. I deal with huge binary trees given in depth-first layout and the depth-first node indices are keys to other node properties, so changing the tree layout has some cost for me which I'd like to minimize. Hence the interest in getting references to works considering other then BF tree layouts.
Large variable-length items arrays in external memory. The arrays are immutable: I don't need to add/delete/edit the items. The only requirement is O(1) element access time and as low overhead as possible, better then straightforward offset and size approach. Here is some statistics I gathered about typical data for my task:
typical number of items - hundreds of millions, up to tens of milliards;
about 30% of items have length not more then 1 bit;
40%-60% items have length less then 8 bits;
only few percents of items have length between 32 and 255 bits (255 bits is the limit)
average item length ~4 bit +/- 1 bit.
any other distribution of item lengths is theoretically possible but all practically interesting cases have statistics close to the described above.
Links to articles of any complexity, tutorials of any obscurity, more or less documented C/C++ libraries, - anything what was useful for you in similar tasks or what looks like that by your educated guess - all such things are gratefully appreciated.
Update: I forgot to add to the question 1: binary trees I'm dealing with are immutable. I have no requirements for altering them, all i need is only traversing them in various ways always moving from node to children or to parent, so that the average cost of such operations was O(1).
Also, typical tree has milliards of nodes and should not be fully stored in RAM.
Update 2 Just if someone interested. I got a couple of good links in https://cstheory.stackexchange.com/a/11265/9276.