Why is lookup in an Array O(1)? - ruby

I believe that in some languages other than Ruby, an Array lookup is O(1) because you know where the data starts, and you multiply the index by the size of the data the array is holding, and then access that memory location.
However, in Ruby, an Array can have objects from different classes, so how does it manage to do a lookup of O(1) complexity?

What #Neil Slater said, with a little more detail…
There are basically two plausible approaches to storing an array of heterogeneous objects of differing sizes:
Store the objects as a singly- or doubly-linked list, with the storage space for each individual object preceded by pointer(s) to the preceding and/or following objects. This structure has the advantage of making it very easy to insert new objects at arbitrary points without shifting around the rest of the array, but the huge downside is that looking up an object by its position is generally O(N), since you have to start from one end of the list and jump through it node-by-node until you arrive at the n-th one.
Store a table or array of constant-sized pointers to the individual objects. Since this lookup table contains constant-sized items in a contiguous ordered layout, looking up the addresses of individual objects O(1); the table is just a C-style array, in which lookup only takes 1-to-a-few machine instructions, even on RISC CPU architectures.
(The allocation strategies for storing the individual objects are also interesting and complex, but not immediately relevant to your question.)
Dynamic languages like Perl/Python/Ruby pretty much all opt for #2 for their general-purpose list/array types. In other words, they make lookup more efficient than inserting objects at random locations in the list, which is the better choice for many applications.
I'm not familiar with the implementation details for Ruby, but they are likely quite similar to those of Python's list type, whose performance and design is explained in wonderful detail at effbot.org.

Its implementation probably contains an array of memory addresses, pointing to the actual objects. Therefore it can still lookup without looping through the array.

Related

Disadvantages of tries

I've been studying tries and checking out their advantages and disadvantages. They're quite useful in many practical applications like dictionary, spell checkers etc due to their constant O(m) look-ups (where m is length of the string) and other advantages like providing ordered retrieval of strings, and getting common prefixes. So, the advantages are pretty clear to me, but the limitations are a bit confusing.
I'm following this link : https://en.wikipedia.org/wiki/Trie
Drawbacks listed here are:
Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.
Follow up question - Why is there a scenario involving secondary storage? Aren't tries also supposed to be stored in main memory. If they're stored in secondary storage, then there's no use of using trie anyways as disk access will always cause greater times.
Some tries can require more space than a hash table, as memory may be allocated for each character in the search string, rather than a single chunk of memory for the whole entry, as in most hash tables.
Follow-up question : Is it due to the fact that tries would contain more references/pointers for connecting each character to next one, and that'd consume more bytes than if it was stored as a whole string? (I got this reason from one of the answers here). Can anyone elaborate this too?
I'd really appreciate some help here. Thanks.
First, "constant O(m) look-ups" is meaningless. Lookup time in a trie is O(m): it depends on the length of the string you're looking up.
A well constructed hash table (i.e. a good hash function and a reasonable load factor) has O(1) lookup time.
Assuming competent construction, looking up a string in a hash table will be much faster than looking it up in a trie.
Tries and hash tables are used for different things. If all you want is the ability to lookup a word, then a hash table will be faster. If you want to find common prefixes, ordered retrieval, or do similar things, then you want a trie.
A hash table can look up individual strings very quickly. It's like a thoroughbred racehorse. That's all it can do. A trie, on the other hand, is a workhorse that can do a lot of things. It'll never be as fast at lookups as a hash table, but it can do lots of things that the hash table can't do.
For example, finding all the words that start with "pre" will take O(n) time with a dictionary because you have to search all of the words. With a trie, it takes three probes to find the subtree that contains all of those words, and then all you have to do is traverse that subtree. Sure, the worst case is O(n), but that's only if all the words in your trie start with "pre".
Whereas it's true that going to disk will be slower than if the entire trie were in memory, it's wrong to say that a disk-based trie offers no advantage over alternatives. If the data won't fit in memory, then no matter what data structure you use, you'll need some external (i.e. non-memory) storage. The fact that your data access is slower when it's on the disk does not fundamentally change the advantages or disadvantages of trie vs. hash table. For example, a disk-based trie will still be faster than a disk-based hash table when it comes to finding all the words with a particular prefix.
A hash table's overhead is typically a constant multiple of the number of words it contains. That is, in addition to the memory required to store the strings, there is per-string overhead to store the mapping between hash code and string.
Memory for a trie is a little more involved. In the worst case, there is one node per character. All those little node allocations start adding up. Imagine a dictionary that contains 200,000 words, and the average word length is five characters. That's a million nodes of overhead.
Fortunately, there are ways to greatly compress a trie, without losing much, if any, performance. The resulting data structure is much smaller and more cache-friendly than a naively constructed trie.
It's been a while since this was asked, but I'd like to add, if anyone is wondering, that a good hashing function should take O(1) time for fixed memory values such as primitive types or fixed-length lists of primitive types. The same logical operations are often applied on all values to be hashed (logical shift left and right, bitwise operations, etc.). These operations take the same time regardless of what value they're used on. This makes hash tables far quicker, and relatively reliable, at storing values that use up a predictable amount of space. Hashing a string can also be done in O(1) time if you traverse the underlying character array and only pick out characters at intervals to ensure that you're always hashing the same amount of memory.
For example, for a string of length 10, you may hash the 10 characters in the underlying character array, whereas for a string of length 100, you hash based on every tenth character.
So, to answer your question, hashing is usually completed in constant time, whereas insertion or retrieval from a trie is O(n) time, where n is the length of the value to be inserted or retrieved. Even if there is little difference in practice, constant has the advantage of being predictable. All operations on a hash table will take the same time each time, give or take. But with a trie (representing a dictionary of Welsh place names), searching for Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch with one character at the end changed will take far more time than searching for "a". The system will eat through the whole string before realising that it is not in the dictionary. Google and other tech companies tend to prefer nice, predictable (but evenly distributed) hashing to avoid security concerns.

Hashes: Tables, Lists and Maps, Oh My?

I've been trying to find some concrete (laymen; non super-academic) definitions for the various types of hash data structures, specifically hash tables, hash lists and hash maps. Online searches provide many useful links to all of these, but never give clear definitions of when it is appropriate to use each over the others.
(1) From a practical standpoint, what's the difference between these 3?
(2) How do their operations' run times differ? Are there clear instances when one should be used or avoided over the other types of hashes?
(3) How do each of these relate back to the Map ADT? Are they all just different implementations of it, or different beasts altogether?
Thanks for any insight here!
There's an abstract data structure that contains mapping between keys and values. It has several different names, including Map, Dictionary, Table, Association Table, and more.
The most basic operations that should be supported by this data-structure are adding, removing and retrieving a value, given its associated key. There are variations and additions around this basic concept - for instance, some structures support iterating over all the key-value pairs, some structures support multiple values per key, etc. There's also a difference in time and space complexity between the various implementations.
Of the multiple implementations available for this data structure, some of the most popular ones utilize hash functions for fast access times. Those implementations are sometimes called by the name Hash Table or Hash Map, you can read more about them in Wikipedia. The performance also varies between hash table implementations, with some reaching amortized O(1) insertion and access complexity (for the price of a lot of space used).
A hash list, on the other hand, is a different thing, and is more about the usage of a data structure, than its actual structures. A hash list is usually just a regular list of hash values, nothing special about it. It's used when verifying the integrity of a large piece of data - in that case it allows various data chunks to be verified independently, allowing for fixing or retrieving of just the bad chunks. This is as opposed to using a single hash value to hash the entire piece of data, in which case a failure means all the data has to be fixed or retrieved again.

why 2D array is better than objects to store x-y coordinates for better performance and less memory?

Assuming I want to store n points with integer (x,y) coordinates. I can use a 2-d (2Xn) array or use a list / collection / or an array of n objects where each object has 2 integer fields to store the coordinates.
As far as I know is the 2d array option is faster and consumes less memory, but I don't know why? Detailed explanation or links with details are appreciated.
This is a very broad question, and kinda has many parts to it. First off, this is relative to the language you are working in. Lets take Java as an example.
When you create an object, it inherits from the main object class. When the object is created, the overhead comes from the fact that the user defined class inherits from Object. The compiler has to virtualize certain method calls in memory so that when you call .equals() or .toString(), the program knows which one to call (that is, your classes' .equals() or Object's .equals()). This is accomplished with a lookup table and determined at runtime with pointers.
This is called virtualization. Now, in java, an array is actually an object, so you really don't gain much from an array of arrays. In fact, you might do better using your own class, since you can limit the metadata associated with it. Arrays in java store information on their length.
However, many of the collections DO have overhead associated with them. ArrayList for example will resize itself and stores metadata about itself in memory, that you might not need. LinkedList has references to other nodes, which is overhead to its actual data.
Now, what I said is only true about Java. In other OO languages, objects behave differently on the insides, and some may be more/less efficient.
In a language such as C++, when you allocate an array, you are really just getting a chunck of memory and it is up to you what you want to do with it. In that sense, it might be better. C++ has similar overhead with its objects if you use overriding (keyword virtual) as it will create these virtual lookups in memory.
All comes down to how efficiently you'll be using the storage space and what your access requirements are. Having to set aside memory to hold a 10,000 x 10,000 array to store only 10 points would be a hideous waste of memory. On the flip side, saving memory by storing the points in a linked list will also be pointless if you spend so much time iterating the list to find the one point you actually need in the 10,000,000 stored.
Some of the downsides of both can be overcome. sparse arrays, pre-sorting the list by some rule so "needed" points float to the top, etc...
In most languages, With a multidimentional array say AxB, you just have a chunk of memory big enough to hold A*B objects, and when you look up an element (m,n) all you need to do is find the element at location m*A+b. When you have an list of objects, there is overhead associated with every list, plus the lookup is more complex than a simple address calculation.
If the size of your matrix is constant, a 2D array is the fastest option. If it needs to grow and shrink though you probably have no option but to use the second approach.

What are some uses for linked lists?

Do linked lists have any practical uses at all. Many computer science books compare them to arrays and say the main advantage is that they are mutable. However, most languages provide mutable versions of arrays. So do linked lists have any actual uses in the real world, or are they just part of computer science theory?
They're absolutely precious (in both the popular doubly-linked version and the less-popular, but simpler and faster when applicable!, single-linked version). For example, inserting (or removing) a new item in a specified "random" spot in a "mutable version of an array" (e.g. a std::vector in C++) is O(N) where N is the number of items in the array, because all that follow (on average half of them) must be shifted over, and that's an O(N) operation; in a list, it's O(1), i.e., constant-time, if you already have e.g. the pointer to the "previous" item. Big-O differences like this are absolutely huge -- the difference between a real-world usable and scalable program, and a toy, "homework"-level one!-)
Linked lists have many uses. For example, implementing data structures that appear to the end user to be mutable arrays.
If you are using a programming language that provides implementations of various collections, many of those collections will be implemented using linked lists. When programming in those languages, you won't often be implementing a linked list yourself but it might be wise to understand them so you can understand what tradeoffs the libraries you use are making. In other words, the set "just part of computer science theory" contains elements that you just need to know if you are going to write programs that just work.
The main Applications of Linked Lists are
For representing Polynomials
It means in addition/subtraction /multipication.. of two polynomials.
Eg:p1=2x^2+3x+7 and p2=3x^3+5x+2
p1+p2=3x^3+2x^2+8x+9
In Dynamic Memory Management
In allocation and releasing memory at runtime.
*In Symbol Tables
in Balancing paranthesis
Representing Sparse Matrix
Ref:-
http://www.cs.ucf.edu/courses/cop3502h.02/linklist3.pdf
So do linked lists have any actual uses in the real world,
A Use/Example of Linked List (Doubly) can be Lift in the Building.
- A person have to go through all the floor to reach top (tail in terms of linked list).
- A person can never go to some random floor directly (have to go through intermediate floors/Nodes).
- A person can never go beyond the top floor (next to the tail node is assigned null).
- A person can never go beyond the ground/last floor (previous to the head node is assigned null in linked list).
Yes of course it's useful for many reasons.
Anytime for example that you want efficient insertion and deletion from the list. To find a place of insertion you have an O(N) search, but to do an insertion if you already have the correct position it is O(1).
Also the concepts you learn from working with linked lists help you learn how to make tree based data structures and many other data structures.
A primary advantage to a linked list as opposed to a vector is that random-insertion time is as simple as decoupling a pair of pointers and recoupling them to the new object (this is of course, slightly more work for a doubly-linked list). A vector, on the other hand generally reorganizes memory space on insertions, causing it to be significantly slower. A list is not as efficient, however, at doing things like adding on the end of the container, due to the necessity to progress all the way through the list.
An Immutable Linked List is the most trivial example of a Persistent Data Structure, which is why it is the standard (and sometimes even only) data structure in many functional languages. Lisp, Scheme, ML, Haskell, Scala, you name it.
Linked Lists are very useful in dynamic memory allocation. These lists are used in operating systems. insertion and deletion in linked lists are very useful. Complex data structures like tree and graphs are implemented using linked lists.
Arrays that grow as needed are always just an illusion, because of the way computer memory works. Under the hood, it's just a continous block of memory that has to be reallocated when enough new elements have been added. Likewise if you remove elements from the array, you'll have to allocate a new block of memory, copy the array and release the previous block to reclaim the unused memory. A linked list allows you to grow and shrink a list of elements without having to reallocate the rest of the list.
Linked lists are useful because elements can be efficiently spliced and removed in the middle as others noted. However a downside to linked lists are poor locality of reference. I prefer not using lists for this reason unless I have an explicit need for the capabilities.

What is the standard OCaml data structure with fastest iteration?

I'm looking for a container that provides fastest unordered iterations through the encapsulated elements. In other words, "add once, iterate many times".
Is there one among OCaml's standard modules that is fast enough (such that further optimization of it would be useless)? Or some kind of third-party GPL-ready ones?
AFAIK there's just one OCaml compiler, so the concept of being fast is more or less clear...
...But after I saw a couple of answers, it appears, it's not. Of course, there's a plenty of data structures that allow O(n) iteration through container of size n. But the task I'm solving is one of those, where difference between O(n) and O(2n) matters ;-).
I also see that Arrays and Lists provide unnecessary information about the order of elements added, which I don't need. Maybe in "functional world" there exists data structures such that can trade this information for a bit of iteration speed.
In C I would outright pick a plain array. The question is, what should I pick in OCaml?
You are unlikely to do better than built-in arrays and lists, since they are hand-coded in C, unless you bind to your own native implementation of an iterator. An array will behave almost exactly like an array in C (a contiguously allocated block of memory containing a sequence of element values), possibly with some extra pointer indirections due to boxing. List are implemented exactly how you would expect: as cells with a value and a "next" pointer. Arrays will give you the best locality for unboxed types (especially floats, which have a super-special unboxed implementation).
For information about the implementation of arrays and lists, see Section 18.3 of the OCaml manual and the files byterun/mlvalues.h, byterun/array.c, and byterun/alloc.c in the OCaml source code.
From the questioner: indeed, Array appeared to be the fastest solution. However it only outperformed List by 7%. Maybe it was because the type of an array element was not plain enough: it was an algebraic type. Hashtbl performed 4 times worse, as expected.
So, I will pick Array and I'm accepting this one. good.
To know for sure, you're going to have to measure. Based on the machine instructions the compiler is likely to generate, I would try an array, then a list.
Access to an array element requires a bounds check, address arithmetic, and a load
Access to the head of a list requires a load, a test for empty list, and a load at a known compile-time offset.
The details of which is faster probably depend on your application and what else is happening on your machine. They also depend on the type of elements; for example, if they are floating-point numbers, ocamlopt may be clever enough to make an unboxed array, which will save you a level of indirection.
Other common data structures like hash tables or balanced trees generally require that you allocate some context somewhere to keep track of where you are. With an array, keeping track requires only an integer index; with a list, keeping track requires a single pointer. I think this is going to be hard to beat in another data structure.
Finally please note that there may be only one OCaml compiler, but it has two back ends: bytecode and native code. Naturally if you care about this level of performance, you are using the native-code ocamlopt version. Right?
Please take measurements and edit the results into your question.
Don't forget about Bigarrays, they are most close to C arrays (just a flat piece of memory), but cannot contain arbitrary OCaml values. Also consider switching bounds checking off (unsafe_set/get). And of course you should profile first.
The array - a linear piece of memory with the items visited in sequential order - best utilises the CPU's L1 data cache.
All common data structures are iterable in O(n) time, so the differences between data structures will only be constant (and very probably not significant).
At least lists and arrays allow iteration without significant overhead. I can't think of a situation where that would not be fast enough.

Resources