I'm a bit struggling with the understanding of the Array class in Ruby. I have seen on Google
that an Array class is actually more of a list, but I can't seem to find how it actually works.
I am really concerned with performance issues as I have to deal with large sorted lists, and
I don't want to step over the whole array to add a single element to it.
So I was wondering if there were any real and clear implementation of a list (as in caml for instance), and I am also looking for a good documentation about how Array's method are implemented, regarding optimization matters.
Thanks!
A Ruby array offers a full list interface:
push/<< for adding element at the end
each provides an iterator for list traversal
sort lets you sort the items with an optional block for a custom comparator
...
So there's no apparent need to have a special List class or Module - take Java for example, we end up using ArrayList if we need a List all the time because it gives us good performance and the additional benefit of accessing elements by their index. So Ruby (similar to other languages such as Python, PHP or Lua) tries to keep things simple with regards to collection types by offering just three types - Array, Hash and Set - therefore with a rich interface that makes it easy to emulate other collection types such as a List, Queue or Deque etc.
If you'd like to know details about the implementation, I would recommend simply downloading the Ruby sources and investigating the corresponding files (for MRI it's array.c in the top-level directory) .
Related
I recently started learning Ruby and hashes. At first I learned that hashes are unordered which makes sense, but now I found out that hashes are ordered with later versions of Ruby. I don't really understand why or the concept behind this.
Could I get some insight as to what the ordered hashes are for? Possible use cases would be nice too for regular hash vs. ordered hash.
Some people like to rely on ordering of a Hash, because the ordered-hash remembers the insertion order of the key/value pairs. This allows the programmer to use a hash somewhat like a queue with random access to the values associated with the keys. This would be useful if they intend to change values on the fly and then iterate over the queue's key/value pairs to retrieve them in the insertion order again.
Also, rather than have to supply indexes into the queue, like they would if they were using an Array-based queue, they can supply a symbolic name.
Instead of:
queue[0]
they can use:
queue[:fred]
That's the only use-case I can see for ordered hashes; It'd be really easy to duplicate the functionality with a queue of keys that preserved the insertion order.
Looking back at some of the previous posts by Matz, he was pretty vague as to why it was implemented. Check out https://www.ruby-forum.com/topic/166075
He basically states that it was implemented to fit some edge cases but he didn't seem to elaborate on it more than that. He also stated that there was no impact on performance, just a negligible increase in memory consumption.
Imagine git commits, handled by ruby git wrapper. They are likely instance of a Hash, with sha as keys. While the sorting by Date makes them easily iteratable in a human-friendly manner.
After reading many similar questions:
JavaScript implementation of a set data structure
Mimicking sets in JavaScript?
Node JS, traditional data structures? (such as Set, etc), anything like Java.util for node?
Efficient Javascript Array Lookup
Best way to find if an item is in a JavaScript array?
How do I check if an array includes an object in JavaScript?
I still have a question: suppose I have a large array of strings (several thousands), and I have to make many lookups (i.e. check many times whether a given string is contained in this array). What is the most efficient way to do this in Node.js ?
A. Sort the array of strings, then use binary search? or:
B. Convert the strings to keys of an object, then use the "in" operator
?
I know that the complexity of A is O(log N), where N is the number of strings.
But I don't know the complexity of B.
IF a Javascript object is implemented as a hash table, then the complexity of B is, on average, O(1), which is better than A. However, I don't know if a Javascript object is really implemented as a hash table!
Update for 2016
Since you're asking about node.js and it is 2016, you can now use either the Set or Map object from ES6 as these are built into ES6. Both allows you to use any string as a key. The Set object is appropriate when you just want to see if the key exists as in:
if (mySet.has(someString)) {
//code here
}
And, Map is appropriate when you want to store a value for that key as in:
if (myMap.has(someString)) {
let val = myMap[someString];
// do something with val here
}
Both ES6 features are now built into node.js as of node V4 (the current version of node.js as of this edit is v6).
See this performance comparison to see how much faster the Set operations are than many other choices.
Older Answer
All important performance questions should be tested with actual performance tests in a tool like jsperf.com. In your case, a javascript object uses a hash-table like implementation because without something that performs pretty well, the whole implementation would be slow since so much of javascript uses object.
String keys on an object would be the first thing I'd test and would be my guess for the best performer. Since the internals of an object are implemented in native code, I'd expect this to be faster than your own hashtable or binary search implemented in javascript.
But, as I started my answer with, you should really test your specific circumstance with the number and length of strings you are most concerned about in a tool like jsperf.
For fixed large array of string I suggest to use some form of radix search
Also, take a look at different data structures and algorithms (AVL trees, queues/heaps etc) in this package
I'm pretty sure that using JS object as storage for strings will result in 'hash mode' for that object. Depending on implementation this could be O(log n) to O(1) time. Look at some jsperf benchmarks to compare property lookup vs binary search on sorted array.
In practice, especially if I'm not going to use the code in browser I would offload this functionality to something like redis or memcached.
I am looking to implement my own collection class. The characteristics I want are:
Iterable - order is not important
Insertion - either at end or at iterator location, it does not matter
Random Deletion - this is the tricky one. I want to be able to have a reference to a piece of data which is guaranteed to be within the list, and remove it from the list in O(1) time.
I plan on the container only holding custom classes, so I was thinking a doubly linked list that required the components to implement a simple interface (or abstract class).
Here is where I am getting stuck. I am wondering whether it would be better practice to simply have the items in the list hold a reference to their node, or to build the node right into them. I feel like both would be fairly simple, but I am worried about coupling these nodes into a bunch of classes.
I am wondering if anyone has an idea as to how to minimize the coupling, or possibly know of another data structure that has the characteristics I want.
It'd be hard to beat a hash map.
Take a look at tries.
Apparently they can beat hashtables:
Unlike most other algorithms, tries have the peculiar feature that the time to insert, or to delete or to find is almost identical because the code paths followed for each are almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure tries can handily beat binary search trees or even hash tables, as well as being better for the CPU's instruction and branch caches.
It may or may not fit your usage, but if it does, it's likely one of the best options possible.
In C++, this sounds like the perfect fit for std::unordered_set (that's std::tr1::unordered_set or boost::unordered_set to you if you have an older compiler). It's implemented as a hash set, which has the characteristics you describe.
Here's the interface documentation. Note that the hash containers actually offer two sets of iterators, the usual ones and local ones which only go through one bucket.
Many other languages have "hash sets" as well, certainly Java and C#.
As a learning excercise, I've just had an attempt at implementing my own 'merge sort' algorithm. I did this on an std::list, which apparently already had the functions sort() and merge() built in. However, I'm planning on moving this over to a linked list of my own making, so the implementation is not particuarly important.
The problem lies with the fact that a std::list doesnt have facilities for accessing random nodes, only accessing the front/back and stepping through. I was originally planning on somehow performing a simple binary search through this list, and finding my answer in a few steps.
The fact that there are already built in functions in an std::list for performing these kinds of ordering leads me to believe that there is an equally easy way to access the list in the way I want.
Anyway, thanks for your help in advance!
The way a linked list works is that you step through the items in the list one at a time. By definition there is no way to access a "random" element in the list. The Sort method you refer to actually creates a brand new list by going through each node one at a time and placing items at the correct location.
You'll need to store the data differently if you want to access it randomly. Perhaps an array of the elements you're storing.
Further information on linked lists: http://en.wikipedia.org/wiki/Linked_list
A merge sort doesn't require access to random elements, only to elements from one end of the list.
I've been coding for quite sometime now. And my work pertains to solving real-world business scenarios. However, I have not really come across any practical usage of some of the data structures like the Linked List, Queues and Stacks etc.
Not even at the business framework level. Of course, there is the ubiquitous HashTable, ArrayList and of late the List...but is there any practical usage of some of the other basic data structures?
It would be great if someone gave a real-world solution where a Doubly Linked List "performs" better than the obvious easily usable counterpart.
Of course it’s possible to get by with only a Map (aka HashTable) and a List. A Queue is only a glorified List but if you use a Queue everywhere you really need a queue then your code gets a lot more readable because nobody has to guess what you are using that List for.
And then there are algorithms that work a lot better when the underlying data structure is not a plain List but a DoublyLinkedList due to the way they have to navigate the list. The same is valid for all other data structures: there’s always a use for them. :)
Stacks can be used for pairing (parseing) such as matching open brackets to closing brackets.
Queues can be used for messaging, or activity processing.
Linked list, or double linked lists can be used for circular navigation.
Most of these algorithms are usually at a lower level than your usual "business" application. For example indices on the database is a variation of a multiply linked list. Implementation of function calling mechanism(or a parse tree) is a stack. Queues and FIFOs are used for servicing network request etc.
These are just examples of collection structures that are optimized for speed in various scenarios.
LIFO-Stack and FIFO-Queue are reasonably abstract (behavioral spec-level) data structures, so of course there are plenty of practical uses for them. For example, LIFO-Stack is a great way to help remove recursion (stack up the current state and loop, instead of making a recursive call); FIFO-Queue helps "buffer up" and "peel away" work nuggets in a coroutine arrangement; etc, etc.
Doubly-linked-List is more of an implementation issue than a behavioral spec-level one, mostly... can be a good way to implement a FIFO-Queue, for example. If you need a sequence with fast splicing and removal give a pointer to one sequence iten, you'll find plenty of other real-world uses, too.
I use queues, linked lists etc. in business solutions all the time.
Except they are implemented by Oracle, IBM, JMS etc.
These constructs are generally at a much lower level of abstaction than you would want while implementing a business solution. Where a business problem would benifit from
such low level constructs (e.g. delivery route planning, production line scheduling etc.) there is usually a package available to do it or you.
I don't use them very often, but they do come up. For example, I'm using a queue in a current project to process asynchronous character equipment changes that must happen in the order the user makes them.
A linked list is useful if you have a subset of "selected" items out of a larger set of items, where you must perform one type of operation on a "selected" item and a default operation or no operation at all on a normal item and the set of "selected" items can change at will (possibly due to user input). Because linked list removal can be done nearly instantaneously (vs. the traversal time it would take for an array search), if the subsets are large enough then it's faster to maintain a linked list than to either maintain an array or regenerate the whole subset by scanning through the whole larger set every time you need the subset.
With a hash table or binary tree, you could search for a single "selected" item, but you couldn't search for all "selected" items without checking every item (or having a separate dictionary for every permutation of selected items, which is obviously impractical).
A queue can be useful if you are in a scenario where you have a lot of requests coming in and you want to make sure to handle them fairly, in order.
I use stacks whenever I have a recursive algorithm, which usually means it's operating on some hierarchical data structure, and I want to print an error message if I run out of memory instead of simply letting the software crash if the program stack runs out of space. Instead of calling the function recursively, I store its local variables in an object, run a loop, and maintain a stack of those objects.