The compilation flag -fmerge-all-constants merges identical constants into a single variable. I keep reading that this results in non-conforming code, and Linus Torvalds wrote that it's inexcusable, but why?
What can possibly happen when you merge two or more identical constant variables?

There are times when programs declare a constant object because they need something with a unique address, and there are times when any address that points to storage holding the proper sequence of byte values would be equally usable. In C, if one writes:
char const * const HelloShared1 = "Hello";
char const * const HelloShared2 = "Hello";
char const HelloUnique1[] = "Hello";
char const HelloUnique2[] = "Hello";
a compiler would have to reserve space for at least three copies of the word Hello, followed by a zero byte. The names HelloUnique1 and HelloUnique2 would refer to two of those copies, and the names HelloShared1 and HelloShared2 would need to identify storage that was distinct from that used by HelloUnique1 and HelloUnique2, but HelloShared1 and HelloShared2 could at the compiler's convenience identify the same storage.
Unfortunately, while the C and C++ Standards usefully provides two ways of specifying objects that hold string literal data, so as to allow programmers to indicate when multiple copies of the same information may be placed in the same storage, it fails to specify any means of specifying the same semantics for any other kind of constant data. For most kinds of applications, situations where a program would care about whether two objects share the same address would be far less common than those where using the same storage for constant objects holding the same data would be advantageous.
Being able to invite an implementation to make optimizations which would not be allowable by the Standard is useful, if one recognizes that programs should not be expected to be compatible with all optimizations, nor vice versa, and if compiler writers do a good job of documenting what kinds of programs different optimizations are compatible with and letting compiler writers enable only optimizations that are known to be compatible with their code.
Fundamentally, optimizations that assume programs won't do X will be useful for applications that don't involve doing X, but at best counter-productive for those that do. The described optimizations would fall into this category. I wouldn't see any basis for complaining about a compiler that makes such optimizations available but doesn't enable them by default. On the other hand, some people believe any program that isn't compatible with any imaginable optimization as "broken".


What is a good way to deal with byte alignment and endianess when packing a struct?

My current design involves communication between an embedded system and PC, where I am always buzzed by the struct design.
The two systems have different endianess that I need to deal with. However, I find that I cannot just do a simple byte-order switch for every 4 bytes to solve the problem. It turns out to depend on the struct.
For example, a struct like this:
uint16_t a;
uint32_t b;
would result in padding between a and b. Eventually, the endian switch has to be specific to a and b because the existence of the padding bytes. But it looks ugly because I need to change the endian switch logic every time I change the struct content.
What is a good strategy to arrange elements in a struct when padding comes in? Should we try to rearrange the elements so that there is only padding bytes at the end of the struct?
I'm afraid you'll need to do some more platform-neutral serialization, since different architectures have different alignment requirements. I don't think there is a safe and generic way to do something like grabbing a chunk of memory and sending it to another architecture where you just place it at some address and read from it (the correct data). Just convert and send the elements one-by-one - you can push the values into a buffer, that will not have any padding and you'll know exactly what is where. Plus you decide which part will do the conversions (typically the PC has more resources to do that). As a bonus you can checksum/sign the communication to catch errors/tampering.
BTW, afaik while the compiler keeps the order of the variables intact, it theoretically can put some additional padding between them (e.g. for performance reasons), so it's not just an architecture related thing.

Performance of std::vector<Test> vs std::vector<Test*>

In an std::vector of a non POD data type, is there a difference between a vector of objects and a vector of (smart) pointers to objects? I mean a difference in the implementation of these data structures by the compiler.
class Test {
std::string s;
Test *other;
std::vector<Test> vt;
std::vector<Test*> vpt;
Could be there no performance difference between vt and vpt?
In other words: when I define a vector<Test>, internally will the compiler create a vector<Test*> anyway?
In other words: when I define a vector, internally will the compiler create a vector anyway?
No, this is not allowed by the C++ standard. The following code is legal C++:
vector<Test> vt;
Test t1; t1.s = "1"; t1.other = NULL;
Test t2; t2.s = "1"; t2.other = NULL;
Test* pt = &vt[0];
Test q = *pt; // q now equal to Test(2)
In other words, a vector "decays" to an array (accessing it like a C array is legal), so the compiler effectively has to store the elements internally as an array, and may not just store pointers.
But beware that the array pointer is valid only as long as the vector is not reallocated (which normally only happens when the size grows beyond capacity).
In general, whatever the type being stored in the vector is, instances of that may be copied. This means that if you are storing a std::string, instances of std::string will be copied.
For example, when you push a Type into a vector, the Type instance is copied into a instance housed inside of the vector. The copying of a pointer will be cheap, but, as Konrad Rudolph pointed out in the comments, this should not be the only thing you consider.
For simple objects like your Test, copying is going to be so fast that it will not matter.
Additionally, with C++11, moving allows avoiding creating an extra copy if one is not necessary.
So in short: A pointer will be copied faster, but copying is not the only thing that matters. I would worry about maintainable, logical code first and performance when it becomes a problem (or the situation calls for it).
As for your question about an internal pointer vector, no, vectors are implemented as arrays that are periodically resized when necessary. You can find GNU's libc++ implementation of vector online.
The answer gets a lot more complicated at a lower than C++ level. Pointers will of course have to be involved since an entire program cannot fit into registers. I don't know enough about that low of level to elaborate more though.

Mapping Untyped Lisp data into a typed binary format for use in compiled functions

Background: I'm writing a toy Lisp (Scheme) interpreter in Haskell. I'm at the point where I would like to be able to compile code using LLVM. I've spent a couple days dreaming up various ways of feeding untyped Lisp values into compiled functions that expect to know the format of the data coming at them. It occurs to me that I am not the first person to need to solve this problem.
Question: What are some historically successful ways of mapping untyped data into an efficient binary format.
Addendum: In point of fact, I do know which of about a dozen different types the data is, I just don't know which one might be sent to the function at compile time. The function itself needs a way to determine what it got.
Do you mean, "I just don't know which [type] might be sent to the function at runtime"? It's not that the data isn't typed; certainly 1 and '() have different types. Rather, the data is not statically typed, i.e., it's not known at compile time what the type of a given variable will be. This is called dynamic typing.
You're right that you're not the first person to need to solve this problem. The canonical solution is to tag each runtime value with its type. For example, if you have a dozen types, number them like so:
0 = integer
1 = cons pair
2 = vector
Once you've done this, reserve the first four bits of each word for the tag. Then, every time two objects get passed in to +, first you perform a simple bit mask to verify that both objects' first four bits are 0b0000, i.e., that they are both integers. If they are not, you jump to an error message; otherwise, you proceed with the addition, and make sure that the result is also tagged accordingly.
This technique essentially makes each runtime value a manually-tagged union, which should be familiar to you if you've used C. In fact, it's also just like a Haskell data type, except that in Haskell the taggedness is much more abstract.
I'm guessing that you're familiar with pointers if you're trying to write a Scheme compiler. To avoid limiting your usable memory space, it may be more sensical to use the bottom (least significant) four bits, rather than the top ones. Better yet, because aligned dword pointers already have three meaningless bits at the bottom, you can simply co-opt those bits for your tag, as long as you dereference the actual address, rather than the tagged one.
Does that help?
Your default solution should be a simple tagged union. If you want to narrow your typing down to more specific types, you can do it - but it won't be that "toy" any more. A thing to look at is called abstract interpretation.
There are few successful implementations of such an optimisation, with V8 being probably the most widespread. In the Scheme world, the most aggressively optimising implementation is Stalin.

Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)

Does anyone know the answer and/or have an opinion about this?
Since tuples would normally not be very large, I would assume it would make more sense to use structs than classes for these. What say you?
Microsoft made all tuple types reference types in the interests of simplicity.
I personally think this was a mistake. Tuples with more than 4 fields are very unusual and should be replaced with a more typeful alternative anyway (such as a record type in F#) so only small tuples are of practical interest. My own benchmarks showed that unboxed tuples up to 512 bytes could still be faster than boxed tuples.
Although memory efficiency is one concern, I believe the dominant issue is the overhead of the .NET garbage collector. Allocation and collection are very expensive on .NET because its garbage collector has not been very heavily optimized (e.g. compared to the JVM). Moreover, the default .NET GC (workstation) has not yet been parallelized. Consequently, parallel programs that use tuples grind to a halt as all cores contend for the shared garbage collector, destroying scalability. This is not only the dominant concern but, AFAIK, was completely neglected by Microsoft when they examined this problem.
Another concern is virtual dispatch. Reference types support subtypes and, therefore, their members are typically invoked via virtual dispatch. In contrast, value types cannot support subtypes so member invocation is entirely unambiguous and can always be performed as a direct function call. Virtual dispatch is hugely expensive on modern hardware because the CPU cannot predict where the program counter will end up. The JVM goes to great lengths to optimize virtual dispatch but .NET does not. However, .NET does provide an escape from virtual dispatch in the form of value types. So representing tuples as value types could, again, have dramatically improved performance here. For example, calling GetHashCode on a 2-tuple a million times takes 0.17s but calling it on an equivalent struct takes only 0.008s, i.e. the value type is 20× faster than the reference type.
A real situation where these performance problems with tuples commonly arises is in the use of tuples as keys in dictionaries. I actually stumbled upon this thread by following a link from the Stack Overflow question F# runs my algorithm slower than Python! where the author's F# program turned out to be slower than his Python precisely because he was using boxed tuples. Manually unboxing using a hand-written struct type makes his F# program several times faster, and faster than Python. These issues would never had arisen if tuples were represented by value types and not reference types to begin with...
The reason is most likely because only the smaller tuples would make sense as value types since they would have a small memory footprint. The larger tuples (i.e. the ones with more properties) would actually suffer in performance since they would be larger than 16 bytes.
Rather than have some tuples be value types and others be reference types and force developers to know which are which I would imagine the folks at Microsoft thought making them all reference types was simpler.
Ah, suspicions confirmed! Please see Building Tuple:
The first major decision was whether
to treat tuples either as a reference
or value type. Since they are
immutable any time you want to change
the values of a tuple, you have to
create a new one. If they are
reference types, this means there can
be lots of garbage generated if you
are changing elements in a tuple in a
tight loop. F# tuples were reference
types, but there was a feeling from
the team that they could realize a
performance improvement if two, and
perhaps three, element tuples were
value types instead. Some teams that
had created internal tuples had used
value instead of reference types,
because their scenarios were very
sensitive to creating lots of managed
objects. They found that using a value
type gave them better performance. In
our first draft of the tuple
specification, we kept the two-,
three-, and four-element tuples as
value types, with the rest being
reference types. However, during a
design meeting that included
representatives from other languages
it was decided that this "split"
design would be confusing, due to the
slightly different semantics between
the two types. Consistency in behavior
and design was determined to be of
higher priority than potential
performance increases. Based on this
input, we changed the design so that
all tuples are reference types,
although we asked the F# team to do
some performance investigation to see
if it experienced a speedup when using
a value type for some sizes of tuples.
It had a good way to test this, since
its compiler, written in F#, was a
good example of a large program that
used tuples in a variety of scenarios.
In the end, the F# team found that it
did not get a performance improvement
when some tuples were value types
instead of reference types. This made
us feel better about our decision to
use reference types for tuple.
If the .NET System.Tuple<...> types were defined as structs, they would not be scalable. For instance, a ternary tuple of long integers currently scales as follows:
type Tuple3 = System.Tuple<int64, int64, int64>
type Tuple33 = System.Tuple<Tuple3, Tuple3, Tuple3>
sizeof<Tuple3> // Gets 4
sizeof<Tuple33> // Gets 4
If the ternary tuple were defined as a struct, the result would be as follows (based on a test example I implemented):
sizeof<Tuple3> // Would get 32
sizeof<Tuple33> // Would get 104
As tuples have built-in syntax support in F#, and they are used extremely often in this language, "struct" tuples would pose F# programmers at risk of writing inefficient programs without even being aware of it. It would happen so easily:
let t3 = 1L, 2L, 3L
let t33 = t3, t3, t3
In my opinion, "struct" tuples would cause a high probability of creating significant inefficiencies in everyday programming. On the other hand, the currently existing "class" tuples also cause certain inefficiencies, as mentioned by #Jon. However, I think that the product of "occurrence probability" times "potential damage" would be much higher with structs than it currently is with classes. Therefore, the current implementation is the lesser evil.
Ideally, there would be both "class" tuples and "struct" tuples, both with syntactic support in F#!
Edit (2017-10-07)
Struct tuples are now fully supported as follows:
Built into mscorlib (.NET >= 4.7) as System.ValueTuple
Available as NuGet for other versions
Syntactic support in C# >= 7
Syntactic support in F# >= 4.1
For 2-tuples, you can still always use the KeyValuePair<TKey,TValue> from earlier versions of the Common Type System. It's a value type.
A minor clarification to the Matt Ellis article would be that the difference in use semantics between reference and value types is only "slight" when immutability is in effect (which, of course, would be the case here). Nevertheless, I think it would have been best in the BCL design not to introduce the confusion of having Tuple cross over to a reference type at some threshold.
I don't know but if you have ever used F# Tuples are part of the language. If I made a .dll and returned a type of Tuples it be nice to have a type to put that in. I suspect now that F# is part of the language (.Net 4) some modifications to CLR were made to accommodate some common structures in F#
From http://en.wikibooks.org/wiki/F_Sharp_Programming/Tuples_and_Records
let scalarMultiply (s : float) (a, b, c) = (a * s, b * s, c * s);;
val scalarMultiply : float -> float * float * float -> float * float * float
scalarMultiply 5.0 (6.0, 10.0, 20.0);;
val it : float * float * float = (30.0, 50.0, 100.0)

Why does Pascal forbid modification of the counter inside the for block?

Is it because Pascal was designed to be so, or are there any tradeoffs?
Or what are the pros and cons to forbid or not forbid modification of the counter inside a for-block? IMHO, there is little use to modify the counter inside a for-block.
Could you provide one example where we need to modify the counter inside the for-block?
It is hard to choose between wallyk's answer and cartoonfox's answer,since both answer are so nice.Cartoonfox analysis the problem from language aspect,while wallyk analysis the problem from the history and the real-world aspect.Anyway,thanks for all of your answers and I'd like to give my special thanks to wallyk.
In programming language theory (and in computability theory) WHILE and FOR loops have different theoretical properties:
a WHILE loop may never terminate (the expression could just be TRUE)
the finite number of times a FOR loop is to execute is supposed to be known before it starts executing. You're supposed to know that FOR loops always terminate.
The FOR loop present in C doesn't technically count as a FOR loop because you don't necessarily know how many times the loop will iterate before executing it. (i.e. you can hack the loop counter to run forever)
The class of problems you can solve with WHILE loops is strictly more powerful than those you could have solved with the strict FOR loop found in Pascal.
Pascal is designed this way so that students have two different loop constructs with different computational properties. (If you implemented FOR the C-way, the FOR loop would just be an alternative syntax for while...)
In strictly theoretical terms, you shouldn't ever need to modify the counter within a for loop. If you could get away with it, you'd just have an alternative syntax for a WHILE loop.
You can find out more about "while loop computability" and "for loop computability" in these CS lecture notes: http://www-compsci.swan.ac.uk/~csjvt/JVTTeaching/TPL.html
Another such property btw is that the loopvariable is undefined after the for loop. This also makes optimization easier
Pascal was first implemented for the CDC Cyber—a 1960s and 1970s mainframe—which like many CPUs today, had excellent sequential instruction execution performance, but also a significant performance penalty for branches. This and other characteristics of the Cyber architecture probably heavily influenced Pascal's design of for loops.
The Short Answer is that allowing assignment of a loop variable would require extra guard code and messed up optimization for loop variables which could ordinarily be handled well in 18-bit index registers. In those days, software performance was highly valued due to the expense of the hardware and inability to speed it up any other way.
Long Answer
The Control Data Corporation 6600 family, which includes the Cyber, is a RISC architecture using 60-bit central memory words referenced by 18-bit addresses. Some models had an (expensive, therefore uncommon) option, the Compare-Move Unit (CMU), for directly addressing 6-bit character fields, but otherwise there was no support for "bytes" of any sort. Since the CMU could not be counted on in general, most Cyber code was generated for its absence. Ten characters per word was the usual data format until support for lowercase characters gave way to a tentative 12-bit character representation.
Instructions are 15 bits or 30 bits long, except for the CMU instructions being effectively 60 bits long. So up to 4 instructions packed into each word, or two 30 bit, or a pair of 15 bit and one 30 bit. 30 bit instructions cannot span words. Since branch destinations may only reference words, jump targets are word-aligned.
The architecture has no stack. In fact, the procedure call instruction RJ is intrinsically non-re-entrant. RJ modifies the first word of the called procedure by writing a jump to the next instruction after where the RJ instruction is. Called procedures return to the caller by jumping to their beginning, which is reserved for return linkage. Procedures begin at the second word. To implement recursion, most compilers made use of a helper function.
The register file has eight instances each of three kinds of register, A0..A7 for address manipulation, B0..B7 for indexing, and X0..X7 for general arithmetic. A and B registers are 18 bits; X registers are 60 bits. Setting A1 through A5 has the side effect of loading the corresponding X1 through X5 register with the contents of the loaded address. Setting A6 or A7 writes the corresponding X6 or X7 contents to the address loaded into the A register. A0 and X0 are not connected. The B registers can be used in virtually every instruction as a value to add or subtract from any other A, B, or X register. Hence they are great for small counters.
For efficient code, a B register is used for loop variables since direct comparison instructions can be used on them (B2 < 100, etc.); comparisons with X registers are limited to relations to zero, so comparing an X register to 100, say, requires subtracting 100 and testing the result for less than zero, etc. If an assignment to the loop variable were allowed, a 60-bit value would have to be range-checked before assignment to the B register. This is a real hassle. Herr Wirth probably figured that both the hassle and the inefficiency wasn't worth the utility--the programmer can always use a while or repeat...until loop in that situation.
Additional weirdness
Several unique-to-Pascal language features relate directly to aspects of the Cyber:
the pack keyword: either a single "character" consumes a 60-bit word, or it is packed ten characters per word.
the (unusual) alfa type: packed array [1..10] of char
intrinsic procedures pack() and unpack() to deal with packed characters. These perform no transformation on modern architectures, only type conversion.
the weirdness of text files vs. file of char
no explicit newline character. Record management was explicitly invoked with writeln
While set of char was very useful on CDCs, it was unsupported on many subsequent 8 bit machines due to its excess memory use (32-byte variables/constants for 8-bit ASCII). In contrast, a single Cyber word could manage the native 62-character set by omitting newline and something else.
full expression evaluation (versus shortcuts). These were implemented not by jumping and setting one or zero (as most code generators do today), but by using CPU instructions implementing Boolean arithmetic.
Pascal was originally designed as a teaching language to encourage block-structured programming. Kernighan (the K of K&R) wrote an (understandably biased) essay on Pascal's limitations, Why Pascal is Not My Favorite Programming Language.
The prohibition on modifying what Pascal calls the control variable of a for loop, combined with the lack of a break statement means that it is possible to know how many times the loop body is executed without studying its contents.
Without a break statement, and not being able to use the control variable after the loop terminates is more of a restriction than not being able to modify the control variable inside the loop as it prevents some string and array processing algorithms from being written in the "obvious" way.
These and other difference between Pascal and C reflect the different philosophies with which they were first designed: Pascal to enforce a concept of "correct" design, C to permit more or less anything, no matter how dangerous.
(Note: Delphi does have a Break statement however, as well as Continue, and Exit which is like return in C.)
Clearly we never need to be able to modify the control variable in a for loop, because we can always rewrite using a while loop. An example in C where such behaviour is used can be found in K&R section 7.3, where a simple version of printf() is introduced. The code that handles '%' sequences within a format string fmt is:
for (p = fmt; *p; p++) {
if (*p != '%') {
switch (*++p) {
case 'd':
/* handle integers */
case 'f':
/* handle floats */
case 's':
/* handle strings */
Although this uses a pointer as the loop variable, it could equally have been written with an integer index into the string:
for (i = 0; i < strlen(fmt); i++) {
if (fmt[i] != '%') {
switch (fmt[++i]) {
case 'd':
/* handle integers */
case 'f':
/* handle floats */
case 's':
/* handle strings */
It can make some optimizations (loop unrolling for instance) easier: no need for complicated static analysis to determine if the loop behavior is predictable or not.
From For loop
In some languages (not C or C++) the
loop variable is immutable within the
scope of the loop body, with any
attempt to modify its value being
regarded as a semantic error. Such
modifications are sometimes a
consequence of a programmer error,
which can be very difficult to
identify once made. However only overt
changes are likely to be detected by
the compiler. Situations where the
address of the loop variable is passed
as an argument to a subroutine make it
very difficult to check, because the
routine's behaviour is in general
unknowable to the compiler.
So this seems to be to help you not burn your hand later on.
Disclaimer: It has been decades since I last did PASCAL, so my syntax may not be exactly correct.
You have to remember that PASCAL is Nicklaus Wirth's child, and Wirth cared very strongly about reliability and understandability when he designed PASCAL (and all of its successors).
Consider the following code fragment:
Without looking at procedure FOO, answer these questions: Does this loop ever end? How do you know? How many times is procedure FOO called in the loop? How do you know?
PASCAL forbids modifying the index variable in the loop body so that it is POSSIBLE to know the answers to those questions, and know that the answers won't change when and if procedure FOO changes.
It's probably safe to conclude that Pascal was designed to prevent modification of a for loop index inside the loop. It's worth noting that Pascal is by no means the only language which prevents programmers doing this, Fortran is another example.
There are two compelling reasons for designing a language that way:
Programs, specifically the for loops in them, are easier to understand and therefore easier to write and to modify and to verify.
Loops are easier to optimise if the compiler knows that the trip count through a loop is established before entry to the loop and invariant thereafter.
For many algorithms this behaviour is the required behaviour; updating all the elements in an array for example. If memory serves Pascal also provides do-while loops and repeat-until loops. Most, I guess, algorithms which are implemented in C-style languages with modifications to the loop index variable or breaks out of the loop could just as easily be implemented with these alternative forms of loop.
I've scratched my head and failed to find a compelling reason for allowing the modification of a loop index variable inside the loop, but then I've always regarded doing so as bad design, and the selection of the right loop construct as an element of good design.
