Does using enums instead of booleans really affect cache usage? - performance

I saw a comment thread where it was suggested that enums should be used instead of booleans in general since it's clearer what the parameters do at the call site, and it's easier to refactor if you need to add a case.
Then someone else claimed that that was a terrible idea since it would often create an unnecessary variable for each call, which would be an unnecessary use of compute resources and cache space.
Does this second claim hold up? My understanding is that booleans are usually not stored as single bits, but as sets of bits with more than enough room for the amount of extra options in a typical enum. So the same amount of data would need to be moved around.
If I understand correctly, there would be an extra variable required if the enum has 3 or more options, (one to store the enum and one for a derived boolean on each check,) but in that case, you actually needed the three options so what can you do? In the case that you have exactly two enum options, then couldn't a compiler just transform it into a boolean in the same register, (assuming the enum value was specified as not having one 0 value and one non-zero value,) therefore not using any extra space?
One extra compare instruction I suppose, but cache usage seems to be the much bigger deal performance-wise these days. And if you have an enum that's isomorphic to a boolean, you'll often make the values automatic anyway, so the compiler should be free to fully optimize it.
Do I understand this correctly, or am I missing something?

Related

Is there a way to get the Ruby runtime to combine frozen identical objects into a single instance?

I have data in memory, especially strings, that have large numbers of duplicates. We're hitting the ceiling with memory sometimes and are trying to reduce our footprint. I thought that if I froze the strings, then the Ruby runtime would combine them into single objects in memory. So I thought that this code would return a lower number, ideally, 1, but it did not:
a = Array.new(1000) { 'foo'.dup.freeze } # create separate objects, but freeze them
sleep 5 # give the runtime some time to combine the objects
a.map(&:object_id).uniq.size # => 1000
I guess this makes sense, because if there was a reference to the duplicated object (e.g. object id #202), and all of the frozen strings are combined to use #200, then dereferencing #202 will fail. So maybe my question doesn't make sense.
I guess the best strategy for me to save memory might be to convert the strings to symbols. I am aware that they will never be garbage collected, there would be a small enough number of them that this would not be a problem. Is there a better way?
You basically have the right idea, but in my opinion you found a big gotcha in Ruby. You are correct that Ruby can dedup frozen strings to save memory but in general frozen ≠ deduped!!!
tl;dr the reason is because the two operations have different semantics. Always use String#-# if you want it deduped.
Recall that freeze is a method of Object, so it has to work with every class. In English, freeze is "make it so no further changes can be made to this object and also return the same object so that I can keep calling methods on it". In particular, it would be odd if x.freeze != x. Imagine if I had two arrays that I was modifying, then decided to freeze them. Would it make sense for the interpreter to then iterate through both arrays to see if their contents are equal and to decide to completely throw away one of them? That could be very expensive. So in general freeze does not promise this behavior and always returns the same object, just frozen.
Deduping works very differently because when you call -myStr you're actually saying "return the unique frozen version of this string in memory". In most cases the whole point is to get a different object than the one in myStr (so that the GC can clean up that string and only keep the frozen one).
Unfortunately, the distinction is muddled since if you call freeze on a string literal, Ruby will dedup it automatically! This is sensible because there's no way to get a reference to the original literal object; the fact that the interpreter is allowing x.freeze != x doesn't matter, so we might as well save some memory. But it might also give the impression that freeze does guarantee deduping, when in fact it does not.
This gotcha was discussed when string deduping was first introduced, so it is definitely an intentional design decision by the Ruby developers.

What's the most efficient way of combining switch/if statements

This question doesn't address any programming language in particular but of course I'm happy to hear some examples.
Imagine a big number of files, let's say 5000, that have all kinds of letters and numbers in it. Then, there is a method that receives a user input that acts as an alias in order to display that file. Without having the files sorted in a folder, the method(s) need to return the file name that is associated to the alias the user provided.
So let's say user input "gd322" stands for the file named "k4e23", the method would look like
if(input.equals("gd322")){
return "k4e23";
}
Now, imagine having 4 values in that method:
switch(input){
case gd322: return fw332;
case g344d: return 5g4gh;
case s3red: return 536fg;
case h563d: return h425d;
} //switch on string, no break, no string indicators, ..., pls ignore the syntax, it's just pseudo
Keeping in mind we have 5000 entries, there are probably more than just 2 entries starting with g. Now, if the user input starts with 's', instead of wasting CPU cycles checking all the a's, b's, c's, ..., we could also make another switch for this, which then directs to the 'next' methods like this:
switch(input[0]){ //implying we could access strings like that
case a: switchA(input);
case b: switchB(input);
// [...]
case g: switchG(input);
case s: switchS(input);
}
So the CPU doesn't have to check on all of them, but rather calls a method like this:
switchG(String input){
switch(input){
case gd322: return fw332;
case g344d: return 5g4gh;
// [...]
}
Is there any field of computer science dealing with this? I don't know how to call it and therefore don't know how to search for it but I think my thoughts make sense on a large scale. Pls move the thread if it doesn't belong here but I really wanna see your thoughts on this.
EDIT: don't quote me on that "5000", I am not in the situation described above and I wanted to talk about this completely theoretical, it could also be 3 entries or 300'000, maybe even less or more
If you have 5000 options, you're probably better off hashing them than having hard-coded if / switch statements. In c++ you could also use std::map to pair a function pointer or other option handling information with each possible option.
Interesting, but I don't think you can give a generic answer. It all depends on how the code is executed. Many compilers will have all kinds of optimizations, in the if and switch, but also in the way strings are compared.
That said, if you have actual (disk) files with those lists, then reading the file will probably take much longer than processing it, since disk I/O is very slow compared to memory access and CPU processing.
And if you have a list like that, you may want to build a hash table, or simply a sorted list/array in which you can perform a binary search. Sorting it also takes time, but if you have to do many lookups in the same list, it may be well worth the time.
Is there any field of computer science dealing with this?
Yes, the science of efficient data structures. Well, isn't that what CS is all about? :-)
The algorithm you described resembles a trie. It wouldn't be statically encoded in the source code with switch statements, but would use dynamic lookups in a structure loaded from somewhere and stuff, but the idea is the same.
Yes the problem is known and solved since decades. Hash functions.
Basically you have a set of values (here strings like "gd322", "g344d") and you want to know if some other value v is among them.
The idea is to put the strings in a big array, at an index calculated from their values by some function. Given a value v, you'll compute an index the same way, and check whether the value v is here or not. Much faster than checking the whole array.
Of course there is a problem with different values falling at the same place : collisions. Some magic is needed then : perfect hash functions whose coefficients are tweaked so values from the initial set don't cause any collisions.

Overhead of std::optional<T>?

Now that std::experimental::optional has been accepted (or is about to be accepted), I wonder what is the overhead and the consequences on the assembly generated when the inner value is get by the following operators :
->
*
value
value_or
compared to the case without std::optional. It could be particularly important for computationaly intensive programs.
For example, what would be order of magnitude of the overhead on operations on a std::vector<std::experimental::optional<double>> compared to a std::vector<double> ?
-> and * ought to have zero overhead.
value and value_or ought to have the overhead of one branch: if(active)
Also, copy/move constructor, copy/move assignment, swap, emplace, operator==, operator<, and the destructor ought to also have the overhead of one branch.
However, one banch of overhead is so small it probably can't even be measured. Seriously, write pretty code, and don't worry about the performance here. Odds are making the code pretty will result in it running faster than if you tried to make it fast. Counter-intuitive, but do it anyway.
There are definitely cases where the overhead becomes noticible, for instance sorting a large number of optionals. In these cases, there's four situations,
(A) all the optionals known to be empty ahead of time, in which case, why sort?
(B) Some optionals may or may not be active, in which case the overhead is required and there is no better way.
(C) All optionals are known to have values ahead of time and you don't need the sorted-data in place, in which case, use the zero overhead operators to make a copy of the data where the copy is using the raw type instead of optional, and sort that.
(D) All optionals are known to have values ahead of time, but you need the sorted data in-place. In this case, optional is adding unnecessary overhead, and the easiest way to work around it is to do step C, and then use the no-overhead operators to move the data back.
Besides the other answer, you should also consider that std::optional requires additional memory.
Often it's not just an extra byte, but (at least for "small" types) a 2x space overhead due to padding .
Maybe RAM isn't a problem but that also means fewer values available in the cache.
A sentinel value, if specific knowledge allows to use it, could be a better choice (probably in the form of markable to keep type safety).
An interesting reading is: Boost optional - Performance considerations

How to get variable/function definitions set in Parallel (e.g. with ParallelMap)?

I have a function that I use to look up a value based on an index. The value takes some time to calculate, so I want to do it with ParallelMap, and references another similar such function that returns a list of expressions, also based on an index.
However, when I set it all up in a seemingly reasonable fashion, I see some very bizarre behaviour. First, I see that the function appears to work, albeit very slowly. For large indexes, however, the processor activity in Taskmangler stays entirely at zero for an extended period of time (i.e. 2-4 minutes) where all instances of Mathematica are seemingly inert. Then, without the slightest blip of CPU use, a result appears. Is this another case of Mathematica spukhafte Fernwirkung?
That is, I want to create a variable/function that stores an expression, here a list of integers (ListOfInts), and then on the parallel workers I want to perform some function on that expression (here I apply a set of replacement rules and take the Min). I want the result of that function to also be indexed by the same index under another variable/function (IndexedFunk), whose result is then available back on the main instance of Mathematica:
(*some arbitrary rules that will convert some of the integers to negative values:*)
rulez=Dispatch[Thread[Rule[Range[222],-Range[222]]]];
maxIndex = 333;
Clear[ListOfInts]
Scan[(ListOfInts[#]=RandomInteger[{1,999},55])&,Range[maxIndex ]]
(*just for safety's sake:*)
DistributeDefinitions[rulez, ListOfInts]
Clear[IndexedFunk]
(*I believe I have to have at least one value of IndexedFunk defined before I Share the definition to the workers:*)
IndexedFunk[1]=Min[ListOfInts[1]]/.rulez
(*... and this should let me retrieve the values back on the primary instance of MMA:*)
SetSharedFunction[IndexedFunk]
(*Now, here is the mysterious part: this just sits there on my multiprocessor machine for many minutes until suddenly a result appears. If I up maxIndex to say 99999 (and of course re-execute the above code again) then the effect can more clearly be seen.*)
AbsoluteTiming[Short[ParallelMap[(IndexedFunk[#]=Min[ListOfInts[#]/.rulez])&, Range[maxIndex]]]]
I believe this is some bug, but then I am still trying to figure out Mathematica Parallel, so I can't be too confident in this conclusion. Despite its being depressingly slow, it is nonetheless impressive in its ability to perform calculations without actually requiring a CPU to do so.
I thought perhaps it was due to whatever communications protocol is being used between the master and slave processes, perhaps it is so slow that it just appears that the processors are doing nothing when if fact they are just waiting to send the next bit of some definition or other. In which case I thought ParallelMap[..., Method->"CoarsestGrained"] would be of some use. But no, that doesn't work neither.
A question: "Am I doing something obviously wrong, or is this a bug?"
I am afraid you are. The problem is with the shared definition of a variable. Mathematica maintains a single coherent value in all copies of the variable across kernels, and therefore that variable becomes a single point of huge contention. CPU is idle because kernels line up to the queue waiting for the variable IndexedFunk, and most time is spent in interprocess or inter-machine communication. Go figure.
By the way, there is no function SetSharedDefinition in any Mathematica version I know of. You probably intended to write SetSharedVariable. But remove that evil call anyway! To avoid contention, return results from the parallelized computation as a list of pairs, and then assemble them into downvalues of your variable at the main kernel:
Clear[IndexedFunk]
Scan[(IndexedFunk[#[[1]]] = #[[2]]) &,
ParallelMap[{#, Min[ListOfInts[#] /. rulez]} &, Range[maxIndex]]
]
ParallelMap takes care of distributing definition automagically, so the call to DistributeDefinitions is superfluous. (As a minor note, it is not correct as written, omitting the maxIndex variable, but the omission is automatically taken care of by ParallelMap in this particular case.)
EDIT, NB!: The automatic distribution applies only to the version 8 of Mathematica. Thanks #MikeHoneychurch for the correction.

Mapping Untyped Lisp data into a typed binary format for use in compiled functions

Background: I'm writing a toy Lisp (Scheme) interpreter in Haskell. I'm at the point where I would like to be able to compile code using LLVM. I've spent a couple days dreaming up various ways of feeding untyped Lisp values into compiled functions that expect to know the format of the data coming at them. It occurs to me that I am not the first person to need to solve this problem.
Question: What are some historically successful ways of mapping untyped data into an efficient binary format.
Addendum: In point of fact, I do know which of about a dozen different types the data is, I just don't know which one might be sent to the function at compile time. The function itself needs a way to determine what it got.
Do you mean, "I just don't know which [type] might be sent to the function at runtime"? It's not that the data isn't typed; certainly 1 and '() have different types. Rather, the data is not statically typed, i.e., it's not known at compile time what the type of a given variable will be. This is called dynamic typing.
You're right that you're not the first person to need to solve this problem. The canonical solution is to tag each runtime value with its type. For example, if you have a dozen types, number them like so:
0 = integer
1 = cons pair
2 = vector
etc.
Once you've done this, reserve the first four bits of each word for the tag. Then, every time two objects get passed in to +, first you perform a simple bit mask to verify that both objects' first four bits are 0b0000, i.e., that they are both integers. If they are not, you jump to an error message; otherwise, you proceed with the addition, and make sure that the result is also tagged accordingly.
This technique essentially makes each runtime value a manually-tagged union, which should be familiar to you if you've used C. In fact, it's also just like a Haskell data type, except that in Haskell the taggedness is much more abstract.
I'm guessing that you're familiar with pointers if you're trying to write a Scheme compiler. To avoid limiting your usable memory space, it may be more sensical to use the bottom (least significant) four bits, rather than the top ones. Better yet, because aligned dword pointers already have three meaningless bits at the bottom, you can simply co-opt those bits for your tag, as long as you dereference the actual address, rather than the tagged one.
Does that help?
Your default solution should be a simple tagged union. If you want to narrow your typing down to more specific types, you can do it - but it won't be that "toy" any more. A thing to look at is called abstract interpretation.
There are few successful implementations of such an optimisation, with V8 being probably the most widespread. In the Scheme world, the most aggressively optimising implementation is Stalin.

Resources