ToLookup vs. ReadOnlyCollection - linq

I've been using LINQ-to-objects for quite a while, but I just now noticed the Enumerable.ToLookup extension method and read its documentation. I came across it while looking for the quickest way to get a read-only interface to an IEnumerable<T>. It seems to me that appending .ToLookup( o => o ) onto the enumerable results in a System.Linq.Lookup object that can serve the same purpose as a ReadOnlyCollection<T>.
So why would I ever create a direct instance of ReadOnlyCollection<T> again?

A lookup is not, conceptually, the same as a read-only enumerable. It's more like a dictionary where each key has multiple values, and is used to look up matching values by key. Calling ToLookup enumerates the input fully and builds the lookup.
A ReadOnlyCollection<T> would be far less expensive as it merely wraps any IList<T>, as well as matching the semantic meaning of a read only interface to an IEnumerable<T>.

Related

Why does delete return the deleted element instead of the new array?

In ruby, Array#delete(obj) will search and remove the specified object from the array. However, may be I'm missing something here but I found the returning value --- the obj itself --- is quite strange and a even a little bit useless.
My humble opinion is that in consistent with methods like sort/sort! and map/map! there should be two methods, e.g. delete/delete!, where
ary.delete(obj) -> new array, with obj removed
ary.delete!(obj) -> ary (after removing obj from ary)
For several reasons, first being that current delete is non-pure, and it should warn the programmer about that just like many other methods in Array (in fact the entire delete_??? family has this issue, they are quite dangerous methods!), second being that returning the obj is much less chainable than returning the new array, for example, if delete were like the above one I described, then I can do multiple deletions in one statement, or I can do something else after deletion:
ary = [1,2,2,2,3,3,3,4]
ary.delete(2).delete(3) #=> [1,4], equivalent to "ary - [2,3]"
ary.delete(2).map{|x|x**2"} #=> [1,9,9,9,16]
which is elegant and easy to read.
So I guess my question is: is this a deliberate design out of some reason, or is it just a heritage of the language?
If you already know that delete is always dangerous, there is no need to add a bang ! to further notice that it is dangerous. That is why it does not have it. Other methods like map may or may not be dangerous; that is why they have versions with and without the bang.
As for why it returns the extracted element, it provides access to information that is cumbersome to refer to if it were not designed like that. The original array after modification can easily be referred to by accessing the receiver, but the extracted element is not easily accessible.
Perhaps, you might be comparing this to methods that add elements, like push or unshift. These methods add elements irrespective of what elements the receiver array has, so returning the added element would be always the same as the argument passed, and you know it, so it is not helpful to return the added elements. Therefore, the modified array is returned, which is more helpful. For delete, whether the element is extracted depends on whether the receiver array has it, and you don't know that, so it is useful to have it as a return value.
For anyone who might be asking the same question, I think I understand it a little bit more now so I might as well share my approach to this question.
So the short answer is that ruby is not a language originally designed for functional programming, neither does it put purity of methods to its priority.
On the other hand, for my particular applications described in my question, we do have alternatives. The - method can be used as a pure alternative of delete in most situations, for example, the code in my question can be implemented like this:
ary = [1,2,2,2,3,3,3,4]
ary.-([2]).-([3]) #=> [1,4], or simply ary.-([2,3])
ary.-([2]).map{|x|x**2"} #=> [1,9,9,9,16]
and you can happily get all the benefits from the purity of -. For delete_if, I guess in most situations select (with return value negated) could be a not-so-great pure candidate.
As for why delete family was designed like this, I think it's more of a difference in point of view. They are supposed to be more of shorthands for commonly needed non-pure procedures than to be juxtaposed with functional-flavored select, map, etc.
I’ve wondered some of these same things myself. What I’ve largely concluded is that the method simply has a misleading name that carries with it false expectations. Those false expectations are what trigger our curiosity as to why the method works like it does. Bottom line—I think it’s a super useful method that we wouldn’t be questioning if it had a name like “swipe_at” or “steal_at”.
Anyway, another alternative we have is values_at(*args) which is functionally the opposite of delete_at in that you specify what you want to keep and then you get the modified array (as opposed to specifying what you want to remove and then getting the removed item).

Purely functional equivalent of weakhashmap?

Weak hash tables like Java's weak hash map use weak references to track the collection of unreachable keys by the garbage collector and remove bindings with that key from the collection. Weak hash tables are typically used to implement indirections from one vertex or edge in a graph to another because they allow the garbage collector to collect unreachable portions of the graph.
Is there a purely functional equivalent of this data structure? If not, how might one be created?
This seems like an interesting challenge. The internal implementation cannot be pure because it must collect (i.e. mutate) the data structure in order to remove unreachable parts but I believe it could present a pure interface to the user, who could never observe the impurities because they only affect portions of the data structure that the user can, by definition, no longer reach.
That's an interesting concept. One major complication in a "purely functional" setting would be that object identity is not normally observable in a "purely functional" sense. I.E., if I copy an object or create a new identical one, in Java it's expected that the clone is not the original. But in a functional setting, it is expected that the new one be semantically identical to the old one, even though the garbage collector will treat it differently.
So, if we allow object identity to be a part of the semantics, it would be sound, otherwise probably not. In the latter case, even if a hack could be found (I thought of one, described below), you're likely to have the language implementation fighting you all over the place because it's going to do all sorts of things to exploit the fact that object identity is not supposed to be observable.
One 'hack' that popped into my mind would be to use unique-by-construction values as keys, so that for the most part value equality will coincide with reference equality. For example, I have a library I use personally in Haskell with the following in its interface:
data Uniq s
getUniq :: IO (Uniq RealWorld)
instance Eq (Uniq s)
instance Ord (Uniq s)
A hash map like you describe would probably mostly-work with these as key, but even here I can think of a way it might break: Suppose a user stores a key in a strict field of some data structure, with the compiler's "unbox-strict-fields" optimization enabled. If 'Uniq' is just a newtype wrapper to a machine integer, there may no longer be any object to which the GC can point and say "that's the key"; so when the user goes and unpacks his key to use it, the map may have forgotten about it already. (Edit: This particular example can obviously be worked around; make Uniq's implementation be something that can't be unboxed like that; the point is just that it's tricky precisely because the compiler is trying to be helpful in a lot of ways we might not expect)
TL;DR: I wouldn't say it can't be done, but I suspect that in many cases "optimizations" will either break or be broken by a weak hash map implementation, unless object identity is given first-class observable status.
Purely functional data-structures can't change from the user perspective. So, if I get a key from a hash-map, wait, and then get the same key again, I have to get the same value. I can hold onto keys, so they can't disappear.
The only way it could work is if the API gives me the next generation and the values aren't collected until all references to the past versions of the container are released. Users of the data-structure are expected to periodically ask for new generations to release weakly held values.
EDIT (based on comment): I understand the behavior you want, but you can't pass this test with a map that releases objects:
FunctionalWeakHashMap map = new FunctionalWeakHashMap();
{ // make scope to make o have no references
Object o = new SomeObject();
map["key"] = o;
} // at this point I lose all references to o, and the reference is weak
// wait as much time as you think it takes for that weak reference to collect,
// force it, etc
Assert.isNotNull(map["key"]); // this must be true or map is not persistent
I am suggesting that this test could pass
FunctionalWeakHashMap map = new FunctionalWeakHashMap();
{ // make scope to make o have no references
Object o = new SomeObject();
map["key"] = o;
} // at this point I lose all references to o, and the reference is weak in the map
// wait as much time as you think it takes for that weak reference to collect,
// force it, etc
map = map.nextGen();
Assert.isNull(map["key"]);

Using Hash Tables as function inputs

I am working in a group that is writing some APIs for tools that we are using in Ruby. When writing API methods, many of my team mates use hash tables as the method's only parameter while I write my methods with each value specified.
For example, a class Apple defined as:
class Apple
#commonName
#volume
#color
end
I would instantiate the class with method:
Apple.new( commonName, volume, color )
My team mates would write it so the method looked like:
Apple.new( {"commonName"=>commonName, "volume"=>volume, "color"=>color )
I don't like using a hash table as the input. To me is seems unnecessarily bulky and doesn't add any clarity to the code. While it doesn't appear to be a big deal in this example, some of our methods have greater than 10 parameters and there will often be hash tables nested in inside other hash tables. I also noticed that using hash tables in this way is extremely uncommon in public APIs(net/telnet is the only exception that I can think of right now).
Question: What arguments could I make to my team members to not use hash tables as input parameters. The bulkiness of the code isn't a sufficient justification(they are not afraid of writing 200-400 character lines) and excessive memory/processing overhead won't work because it won't become an issue with the way our tools will be used.
Actually if your method takes more than 10 arguments, you should either redesign your class or eat dirt and use hashes. For any method that takes more than 4 arguments, using typical arguments can be counter-intuitive while calling the method, because you got to remember the order correctly.
I think best solution would be to simply redesign such methods and use something like builder or fluent patterns.
First of all, you should chide them for using strings instead of symbols for hash keys.
One issue with using a hash is that you then have to check that all the appropriate keys are in it. This makes it useful for optional parameters, but for mandatory one, why not use the built-in functionality of the language? For example, with their method, what happens if I do this:
Apple.new( {"commonName"=>commonName, "volume"=>volume} )
Whereas, with Apple.new(commonName, volume), you know you'll get an ArgumentError.
Named parameters make for more self-documenting code which is nice. But other than that there's not a lot of difference. The Hash allows for more flexibility, especially if you start doing any method aliasing. Also, the various Hash methods in ActiveSupport make setting defaults and verifying inputs pretty painless. I guess this probably wasn't the answer you were looking for.

Return concrete or abstract datatypes?

I'm in the middle of reading Code Complete, and towards the end of the book, in the chapter about refactoring, the author lists a bunch of things you should do to improve the quality of your code while refactoring.
One of his points was to always return as specific types of data as possible, especially when returning collections, iterators etc. So, as I've understood it, instead of returning, say, Collection<String>, you should return HashSet<String>, if you use that data type inside the method.
This confuses me, because it sounds like he's encouraging people to break the rule of information hiding. Now, I understand this when talking about accessors, that's a clear cut case. But, when calculating and mangling data, and the level of abstraction of the method implies no direct data structure, I find it best to return as abstract a datatype as possible, as long as the data doesn't fall apart (I wouldn't return Object instead of Iterable<String>, for example).
So, my question is: is there a deeper philosophy behind Code Complete's advice of always returning as specific a data type as possible, and allow downcasting, instead of maintaining a need-to-know-basis, that I've just not understood?
I think it is simply wrong for the most cases. It has to be:
be as lenient as possible, be as specific as needed
In my opinion, you should always return List rather than LinkedList or ArrayList, because the difference is more an implementation detail and not a semantic one. The guys from the Google collections api for Java taking this one step further: they return (and expect) iterators where that's enough. But, they also recommend to return ImmutableList, -Set, -Map etc. where possible to show the caller he doesn't have to make a defensive copy.
Beside that, I think the performance of the different list implementations isn't the bottleneck for most applications.
Most of the time one should return an interface or perhaps an abstract type that represents the return value being returned. If you are returning a list of X, then use List. This ultimately provides maximum flexibility if the need arises to return the list type.
Maybe later you realise that you want to return a linked list or a readonly list etc. If you put a concrete type your stuck and its a pain to change. Using the interface solves this problem.
#Gishu
If your api requires that clients cast straight away most of the time your design is suckered. Why bother returning X if clients need to cast to Y.
Can't find any evidence to substantiate my claim but the idea/guideline seems to be:
Be as lenient as possible when accepting input. Choose a generalized type over a specialized type. This means clients can use your method with different specialized types. So an IEnumerable or an IList as an input parameter would mean that the method can run off an ArrayList or a ListItemCollection. It maximizes the chance that your method is useful.
Be as strict as possible when returning values. Prefer a specialized type if possible. This means clients do not have to second-guess or jump through hoops to process the return value. Also specialized types have greater functionality. If you choose to return an IList or an IEnumerable, the number of things the caller can do with your return value drastically reduces - e.g. If you return an IList over an ArrayList, to get the number of elements returned - use the Count property, the client must downcast. But then such downcasting defeats the purpose - works today.. won't tomorrow (if you change the Type of returned object). So for all purposes, the client can't get a count of elements easily - leading him to write mundane boilerplate code (in multiple places or as a helper method)
The summary here is it depends on the context (exceptions to most rules). E.g. if the most probable use of your return value is that clients would use the returned list to search for some element, it makes sense to return a List Implementation (type) that supports some kind of search method. Make it as easy as possible for the client to consume the return value.
I could see how, in some cases, having a more specific data type returned could be useful. For example knowing that the return value is a LinkedList rather than just List would allow you to do a delete from the list knowing that it will be efficient.
I think, while designing interfaces, you should design a method to return the as abstract data type as possible. Returning specific type would make the purpose of the method more clear about what they return.
Also, I would understand it in this way:
Return as abstract a data type as possible = return as specific a data type as possible
i.e. when your method is supposed to return any collection data type return collection rather than object.
tell me if i m wrong.
A specific return type is much more valuable because it:
reduces possible performance issues with discovering functionality with casting or reflection
increases code readability
does NOT in fact, expose more than is necessary.
The return type of a function is specifically chosen to cater to ALL of its callers. It is the calling function that should USE the return variable as abstractly as possible, since the calling function knows how the data will be used.
Is it only necessary to traverse the structure? is it necessary to sort the structure? transform it? clone it? These are questions only the caller can answer, and thus can use an abstracted type. The called function MUST provide for all of these cases.
If,in fact, the most specific use case you have right now is Iterable< string >, then that's fine. But more often than not - your callers will eventually need to have more details, so start with a specific return type - it doesn't cost anything.

What is an elegant way to track the size of a set of objects without a single authoritative collection to reference?

Update: Please read this question in the context of design principles, elegance, expression of intent, and especially the "signals" sent to other programmers by design choices.
I have two "views" of a set of objects. One is a dictionary/map indexing the objects by a string value. The other is a dictionary/map indexing the objects by an ordinal (ordering integer). There is no "master" collection of the objects by themselves that can serve as the authoritative source for the number of objects, but the two dictionaries should always both contain references to all the objects.
When a new item is added to the set a reference is added to both dictionaries, and then some processing needs to be done which is affected by the new total number of objects.
What should I use as the authoritative source to reference for the current size of the set of objects? It seems that all my options are flawed in one dimension or another. I can just consistently reference one of the dictionaries, but that would codify an implication of that dictionary's superiority over the other. I could add a 3rd collection, a simple list of the objects to serve as the authoritative list, but that increases redundancy. Storing a running count seems simplest, but also increases redundancy and is more brittle than referencing a collection's self-tracked count on the fly.
Is there another option that will allow me to avoid choosing the lesser evil, or will I have to accept a compromise on elegance?
I would create a class that has (at least) two collections.
A version of the collection that is
sorted by string
A version of the
collection that is sorted by ordinal
(Optional) A master collection
The class would handle the nitty gritty management:
The syncing of the contents for the collections
Standard collection actions (e.g. Allow users get the size, Add or retrieve items)
Let users get by string or ordinal
That way you can use the same collection wherever you need either behavior, but still abstract away the "indexing" behavior you are going for.
The separate class gives you a single interface with which to explain your intent regarding how this class is to be used.
I'd suggest encapsulation: create a class that hides the "management" details (such as the current count) and use it to expose immutable "views" of the two collections.
Clients will ask the "manglement" object for an appropriate reference to one of the collections.
Clients adding a "term" (for lack of a better word) to the collections will do so through the "manglement" object.
This way your assumptions and implementation choices are "hidden" from clients of the service and you can document that the choice of collection for size/count was arbitrary. Future maintainers can change how the count is managed without breaking clients.
BTW, yes, I meant "manglement" - my favorite malapropism for management (in any context!)
If both dictionaries contain references to every object, the count should be the same for both of them, correct? If so, just pick one and be consistent.
I don't think it is a big deal at all. Just reference the sets in the same order each time
you need to get access to them.
If you really are concerned about it you could encapsulate the collections with a wrapper that exposes the public interfaces - like
Add(item)
Count()
This way it will always be consistent and atomic - or at least you could implement it that way.
But, I don't think it is a big deal.

Resources