What is an elegant way to track the size of a set of objects without a single authoritative collection to reference? - algorithm

Update: Please read this question in the context of design principles, elegance, expression of intent, and especially the "signals" sent to other programmers by design choices.
I have two "views" of a set of objects. One is a dictionary/map indexing the objects by a string value. The other is a dictionary/map indexing the objects by an ordinal (ordering integer). There is no "master" collection of the objects by themselves that can serve as the authoritative source for the number of objects, but the two dictionaries should always both contain references to all the objects.
When a new item is added to the set a reference is added to both dictionaries, and then some processing needs to be done which is affected by the new total number of objects.
What should I use as the authoritative source to reference for the current size of the set of objects? It seems that all my options are flawed in one dimension or another. I can just consistently reference one of the dictionaries, but that would codify an implication of that dictionary's superiority over the other. I could add a 3rd collection, a simple list of the objects to serve as the authoritative list, but that increases redundancy. Storing a running count seems simplest, but also increases redundancy and is more brittle than referencing a collection's self-tracked count on the fly.
Is there another option that will allow me to avoid choosing the lesser evil, or will I have to accept a compromise on elegance?

I would create a class that has (at least) two collections.
A version of the collection that is
sorted by string
A version of the
collection that is sorted by ordinal
(Optional) A master collection
The class would handle the nitty gritty management:
The syncing of the contents for the collections
Standard collection actions (e.g. Allow users get the size, Add or retrieve items)
Let users get by string or ordinal
That way you can use the same collection wherever you need either behavior, but still abstract away the "indexing" behavior you are going for.
The separate class gives you a single interface with which to explain your intent regarding how this class is to be used.

I'd suggest encapsulation: create a class that hides the "management" details (such as the current count) and use it to expose immutable "views" of the two collections.
Clients will ask the "manglement" object for an appropriate reference to one of the collections.
Clients adding a "term" (for lack of a better word) to the collections will do so through the "manglement" object.
This way your assumptions and implementation choices are "hidden" from clients of the service and you can document that the choice of collection for size/count was arbitrary. Future maintainers can change how the count is managed without breaking clients.
BTW, yes, I meant "manglement" - my favorite malapropism for management (in any context!)

If both dictionaries contain references to every object, the count should be the same for both of them, correct? If so, just pick one and be consistent.

I don't think it is a big deal at all. Just reference the sets in the same order each time
you need to get access to them.
If you really are concerned about it you could encapsulate the collections with a wrapper that exposes the public interfaces - like
Add(item)
Count()
This way it will always be consistent and atomic - or at least you could implement it that way.
But, I don't think it is a big deal.

Related

Is there such a thing as ‘class bloat’ - i.e. too many classes causing inefficiencies?

E.g. let’s consider I have the following classes:
Item
ItemProperty which would include objects such as Colour and Size. There's a relation-property of the Item class which lists all of the ItemProperty objects applicable to this Item (i.e. for one item you might need to specify the Colour and for another you might want to specify the Size).
ItemPropertyOption would include objects such as Red, Green (for Colour) and Big, Small (for Size).
Then an Item Object would relate to an ItemProperty, whereas an ItemChoice Object would relate to an ItemPropertyOption (and the ItemProperty which the ItemPropertyOption refers to could be inferred).
The reason for this is so I could then make use of queries much more effectively. i.e. give me all item-choices which are Red. It would also allow me to use the Parse Dashboard to quickly add elements to the site as I could easily specify more ItemPropertys and ItemPropertyOptions, rather than having to add them in the codebase.
This is just a small example and there's many more instances where I'd like to use classes so that 'options' for various drop-downs in forms are in the database and can easily be added and edited by me, rather than hard-coded.
1) I’ll probably be doing this in a similar way for 5+ more similar kinds of class-structures
2) there could be hundreds of nested properties that I want to access via ‘inverse querying’
So, I can think of 2 potential causes of inefficiency and wanted to know if they’re founded:
Is having lots of classes inefficient?
Is back-querying against nested classes inefficient?
The other option I can think of — if ‘class-bloat’ really is a problem — is to make fields on parent classes that, instead of being nested across other classes (that represent further properties, as above), just representing them as a nested JSON property directly.
The job of designing is to render in object descriptions truths about the world that are relevent to the system's requirements. In the world of the OP's "items", it's a fact that items have color, and it's a relevant fact because users care about an item's color. You'd only call a system inefficient if it consumes computing resources that it doesn't need to consume.
So, for something like a configurator, the fact that we have items, and that those items have properties, and those properties have an enumerable set of possible values sounds like a perfectly rational design.
Is it inefficient or "bloated"? The only place I'd raise doubt is in the explicit assertion that items have properties. Of course they do, but that's natively true of javascript objects and parse entities.
In other words, you might be able to get along with just item and several flavors of propertyOptions: e.g. Item has an attribute called "colorProperty" that is a pointer to an instance of "ColorProperty" (whose instances have a name property like 'red', 'green', etc. and maybe describe other pertinent facts, like a more precise description in RGB form).
There's nothing wrong with lots of classes if they represent relevant truth. Do that first. You might discover empirically that your design is too resource consumptive (I doubt you will in this case), at which point we'd start looking for cheats to be somehow skinnier. But do it the right way first, cheat later only if you must.
Is having lots of classes inefficient?
It's certainly inefficient for poor humans who have to remember what all those classes do and how they're related to each other. It takes time to write all those classes in the first place, and every line that you write is a line that has to be maintained.
Beyond that, there's certainly some cost for each class in any OOP language, and creating more classes than you really need will mean that you're paying more than you need to for the work that you're doing, which is pretty much the definition of inefficient.
I’ll probably be doing this in a similar way for 5+ more similar kinds of class-structures
Maybe you could spend some time thinking about the similarity between these cases and come up with a single set of more flexible classes that you can use in all those cases. Writing general code is harder than writing very specific code, but if you do a good job you'll recoup the extra effort many times over through reuse.

Is accessing Generic Objects bad compared to Strict Data-Type classes in AS3?

I'm having a debate with a friend regarding Generic Objects vs. Strict Data-Type instances access.
If I have a fairly large JSON file to convert to objects & arrays of data in Flash, is it best that I then convert those objects to strict AS3 classes dedicated to each objects?
Is there a significant loss on performance depending on the quantity of objects?
What's the technical reason behind this? Does Generic Object leave a bigger foot-print in memory than Strict Data-Type instances of a custom class?
It's hard to answer this question on a generic scale since in the end "it all depends". What it depends on is what type of objects you are working with, how you expose those objects to the rest of the program and what type of requirements you have on your runtime environment.
Generally speaking, generic objects are bad since you no longer have "type security".
Generally speaking, converting objects to typed objects forces you to leave a bigger memory footprint since you need to run that class during runtime, and also forces you to recompile an untyped object "again" into another type of object, causing some extra cpu cycles.
In the end it kinda bowls down to this, if the data that you received is exposed to the rest of system, it's generally a good idea to convert it into some kind of typed object.
Converting it to a typed object and then working on that object, improves code readability and makes it easier to read the code since you don't have to remember if the data/key table used "image" or "Image" or "MapImage" as the accessor to retrieve the image info of something.
Also, if you ever change the backend system to provide other/renamed keys, you only have to do the change in one place, instead of scattered all over the system.
Hope this answer helps :)

Syntax when dereferencing database-backed tree

I'm using MongoDB, so my clusters of data are in dictionaries. Some of these contain references to other Mongo objects. For example, say I have a Person document which has a separate Employer document. I would like to control element access so I can automatically dereference documents. I also have some data with dates, and since PyMongo can't store timezone info, I'd like to store a string timezone alongside the UTC time and have an accessor to the converted times easily.
Which of these options seems the best to you?
Person = {'employer': ObjectID}
Employer = {'name': str}
Option 1: Augmented operations are methods
Examples
print person.get_employer()['name']
person.get_employer()['name'] = 'Foo'
person.set_employer(new_employer)
Pro: Method syntax makes it clear that getting the employer is not just dictionary access
Con: Differences between the syntaxes between referenced objects and not, making it hard to normalize the schema if necessary. Augmenting an element would require changing the callers
Option 2: Everything is an attribute
Examples
print person.employer.name
person.employer.name = 'Foo'
person.employer = new_employer
Pro: Uniform syntax for augmented and non-augmented
?: Makes it unclear that this is backed by a dictionary, but provides a layer of abstraction?
Con: Requires morphing a dictionary to an object, not pythonic?
Option 3: Everything is a dictionary item
Examples
print person['employer']['name']
person['employer']['name'] = 'Foo'
person['employer'] = new_employer
Pro: Uniform syntax for augmented and non-augmented
?: Makes it unclear that some of these accesses are actually method calls, but provides a layer of abstraction?
Con: Dictionary item syntax is error-prone to type IMHO.
Your first 2 options would require making a "Person" class and an "Employer" class, and using __dict__ to read values and setattr for writing values. This approach will be slower, but will be more flexible (you can add new methods, validation, etc.)
The simplest way would be to use only dictionaries (option 3). It wouldn't require any need for oop. Personally, I also find it to be the most readable of the 3.
So, if I were you, I would use option 3. It is nice and simple, and easy to expand on later if you change your mind. If I had to choose between the first two, I would choose the second (I don't like overusing getters and setters).
P.S. I'd keep away from person.get_employer()['name'] = 'Foo', regardless of what you do.
Do not be afraid to write a custom class when that will make the subsequent code easier to write/read/debug/etc.
Option 1 is good when you're calling something that's slow/intensive/whatever -- and you'll want to save the results so can use option 2 for subsequent access..
Option 2 is your best bet -- less typing, easier to read, create your classes once then instantiate and away you go (no need to morph your dictionary).
Option 3 doesn't really buy you anything over option 2 (besides more typing, plus allowing typos to pass instead of erroring out)

Purely functional equivalent of weakhashmap?

Weak hash tables like Java's weak hash map use weak references to track the collection of unreachable keys by the garbage collector and remove bindings with that key from the collection. Weak hash tables are typically used to implement indirections from one vertex or edge in a graph to another because they allow the garbage collector to collect unreachable portions of the graph.
Is there a purely functional equivalent of this data structure? If not, how might one be created?
This seems like an interesting challenge. The internal implementation cannot be pure because it must collect (i.e. mutate) the data structure in order to remove unreachable parts but I believe it could present a pure interface to the user, who could never observe the impurities because they only affect portions of the data structure that the user can, by definition, no longer reach.
That's an interesting concept. One major complication in a "purely functional" setting would be that object identity is not normally observable in a "purely functional" sense. I.E., if I copy an object or create a new identical one, in Java it's expected that the clone is not the original. But in a functional setting, it is expected that the new one be semantically identical to the old one, even though the garbage collector will treat it differently.
So, if we allow object identity to be a part of the semantics, it would be sound, otherwise probably not. In the latter case, even if a hack could be found (I thought of one, described below), you're likely to have the language implementation fighting you all over the place because it's going to do all sorts of things to exploit the fact that object identity is not supposed to be observable.
One 'hack' that popped into my mind would be to use unique-by-construction values as keys, so that for the most part value equality will coincide with reference equality. For example, I have a library I use personally in Haskell with the following in its interface:
data Uniq s
getUniq :: IO (Uniq RealWorld)
instance Eq (Uniq s)
instance Ord (Uniq s)
A hash map like you describe would probably mostly-work with these as key, but even here I can think of a way it might break: Suppose a user stores a key in a strict field of some data structure, with the compiler's "unbox-strict-fields" optimization enabled. If 'Uniq' is just a newtype wrapper to a machine integer, there may no longer be any object to which the GC can point and say "that's the key"; so when the user goes and unpacks his key to use it, the map may have forgotten about it already. (Edit: This particular example can obviously be worked around; make Uniq's implementation be something that can't be unboxed like that; the point is just that it's tricky precisely because the compiler is trying to be helpful in a lot of ways we might not expect)
TL;DR: I wouldn't say it can't be done, but I suspect that in many cases "optimizations" will either break or be broken by a weak hash map implementation, unless object identity is given first-class observable status.
Purely functional data-structures can't change from the user perspective. So, if I get a key from a hash-map, wait, and then get the same key again, I have to get the same value. I can hold onto keys, so they can't disappear.
The only way it could work is if the API gives me the next generation and the values aren't collected until all references to the past versions of the container are released. Users of the data-structure are expected to periodically ask for new generations to release weakly held values.
EDIT (based on comment): I understand the behavior you want, but you can't pass this test with a map that releases objects:
FunctionalWeakHashMap map = new FunctionalWeakHashMap();
{ // make scope to make o have no references
Object o = new SomeObject();
map["key"] = o;
} // at this point I lose all references to o, and the reference is weak
// wait as much time as you think it takes for that weak reference to collect,
// force it, etc
Assert.isNotNull(map["key"]); // this must be true or map is not persistent
I am suggesting that this test could pass
FunctionalWeakHashMap map = new FunctionalWeakHashMap();
{ // make scope to make o have no references
Object o = new SomeObject();
map["key"] = o;
} // at this point I lose all references to o, and the reference is weak in the map
// wait as much time as you think it takes for that weak reference to collect,
// force it, etc
map = map.nextGen();
Assert.isNull(map["key"]);

Return concrete or abstract datatypes?

I'm in the middle of reading Code Complete, and towards the end of the book, in the chapter about refactoring, the author lists a bunch of things you should do to improve the quality of your code while refactoring.
One of his points was to always return as specific types of data as possible, especially when returning collections, iterators etc. So, as I've understood it, instead of returning, say, Collection<String>, you should return HashSet<String>, if you use that data type inside the method.
This confuses me, because it sounds like he's encouraging people to break the rule of information hiding. Now, I understand this when talking about accessors, that's a clear cut case. But, when calculating and mangling data, and the level of abstraction of the method implies no direct data structure, I find it best to return as abstract a datatype as possible, as long as the data doesn't fall apart (I wouldn't return Object instead of Iterable<String>, for example).
So, my question is: is there a deeper philosophy behind Code Complete's advice of always returning as specific a data type as possible, and allow downcasting, instead of maintaining a need-to-know-basis, that I've just not understood?
I think it is simply wrong for the most cases. It has to be:
be as lenient as possible, be as specific as needed
In my opinion, you should always return List rather than LinkedList or ArrayList, because the difference is more an implementation detail and not a semantic one. The guys from the Google collections api for Java taking this one step further: they return (and expect) iterators where that's enough. But, they also recommend to return ImmutableList, -Set, -Map etc. where possible to show the caller he doesn't have to make a defensive copy.
Beside that, I think the performance of the different list implementations isn't the bottleneck for most applications.
Most of the time one should return an interface or perhaps an abstract type that represents the return value being returned. If you are returning a list of X, then use List. This ultimately provides maximum flexibility if the need arises to return the list type.
Maybe later you realise that you want to return a linked list or a readonly list etc. If you put a concrete type your stuck and its a pain to change. Using the interface solves this problem.
#Gishu
If your api requires that clients cast straight away most of the time your design is suckered. Why bother returning X if clients need to cast to Y.
Can't find any evidence to substantiate my claim but the idea/guideline seems to be:
Be as lenient as possible when accepting input. Choose a generalized type over a specialized type. This means clients can use your method with different specialized types. So an IEnumerable or an IList as an input parameter would mean that the method can run off an ArrayList or a ListItemCollection. It maximizes the chance that your method is useful.
Be as strict as possible when returning values. Prefer a specialized type if possible. This means clients do not have to second-guess or jump through hoops to process the return value. Also specialized types have greater functionality. If you choose to return an IList or an IEnumerable, the number of things the caller can do with your return value drastically reduces - e.g. If you return an IList over an ArrayList, to get the number of elements returned - use the Count property, the client must downcast. But then such downcasting defeats the purpose - works today.. won't tomorrow (if you change the Type of returned object). So for all purposes, the client can't get a count of elements easily - leading him to write mundane boilerplate code (in multiple places or as a helper method)
The summary here is it depends on the context (exceptions to most rules). E.g. if the most probable use of your return value is that clients would use the returned list to search for some element, it makes sense to return a List Implementation (type) that supports some kind of search method. Make it as easy as possible for the client to consume the return value.
I could see how, in some cases, having a more specific data type returned could be useful. For example knowing that the return value is a LinkedList rather than just List would allow you to do a delete from the list knowing that it will be efficient.
I think, while designing interfaces, you should design a method to return the as abstract data type as possible. Returning specific type would make the purpose of the method more clear about what they return.
Also, I would understand it in this way:
Return as abstract a data type as possible = return as specific a data type as possible
i.e. when your method is supposed to return any collection data type return collection rather than object.
tell me if i m wrong.
A specific return type is much more valuable because it:
reduces possible performance issues with discovering functionality with casting or reflection
increases code readability
does NOT in fact, expose more than is necessary.
The return type of a function is specifically chosen to cater to ALL of its callers. It is the calling function that should USE the return variable as abstractly as possible, since the calling function knows how the data will be used.
Is it only necessary to traverse the structure? is it necessary to sort the structure? transform it? clone it? These are questions only the caller can answer, and thus can use an abstracted type. The called function MUST provide for all of these cases.
If,in fact, the most specific use case you have right now is Iterable< string >, then that's fine. But more often than not - your callers will eventually need to have more details, so start with a specific return type - it doesn't cost anything.

Resources