One more difference between gcc's and MS preprocessor - visual-studio

One more difference between gcc preprocessor and that of MS VS cl. Consider the following snippet:
# define A(x) L ## x
# define B A("b")
# define C(x) x
C(A("a" B))
For 'gcc -E' we get the following:
L"a" A("b")
For 'cl /E' the output is different:
L"a" L"b"
MS preprocessor somehow performs an additional macro expansion. Algorithm of its work is obviously different from that of gcc, but this algorithm also seems to be a secret. Does anyone know how the observed difference can be explained and what is the scheme of preprocessing in MS cl?

GCC is correct. The standard specifies:
C99 6.10.3.4/2 (and also C++98/11 16.3.4/2): If the name of the macro being replaced is found during this scan of the replacement list
(not including the rest of the source file’s preprocessing tokens), it is not replaced.
So, when expanding A("a" B), we first replace B to give A("a" A("B")).
A("B") is not replaced, according to the quoted rule, so the final result is L"a" A("B").

Mike's answer is correct, but he actually elides the critical part of the standard that shows why this is so:
6.10.3.4/2 If the name of the macro being replaced is found during this scan of the replacement list
(not including the rest of the source file’s preprocessing tokens), it is not replaced.
Furthermore, if any nested replacements encounter the name of the macro being replaced,
it is not replaced. These nonreplaced macro name preprocessing tokens are no longer
available for further replacement even if they are later (re)examined in contexts in which
that macro name preprocessing token would otherwise have been replaced.
Note the last clause here that I've emphasized.
So both gcc and MSVC expand the macro A("a" B) to L"a" A("b"), but the interesting case (where MSVC screws up) is when the macro is wrapped by the C macro.
When expanding the C macro, its argument is first examined for macros to expand and A is expanded. This is then substituted into the body of C, and then that body is then scanned AGAIN for macros to replace. Now you might think that since this is the expansion of C, only the name C will be skipped, but this last clause means that the tokens from the expansion of A will also skip reexpansions of A.

There are basically two ways of how one could think that the remaining occurrance of the A macro should be replaced:
The first would be the processing or macro arguments before they are inserted in place of the corresponding parameter in the macro's replacement list. Usually each argument is complete macro-replaced as if it formed the rest of the input file, as decribed in section 6.10.3.1
of the standard. However, this is not done if the parameter (here: x) occurs next to the ##; in this case the parameter is simply replaced with the argument according to 6.10.3.3, without any recursive macro replacement.
The second way would be the "rescanning and further replacement" of section 6.10.3.4, but this not done recursively for a macro that has already been replaced once.
So neither applies in this case, which means that gcc is correct in leaving that occurrence of A unreplaced.

Related

Macro contains a cycle

So I'm trying to make a lexical analyzer for scheme and when I run JFlex to convert the lever.flex file I get an error similar to this one for example:
Reading "lexer.flex"
Macro definition contains a cycle.
1 error, 0 warnings.
the macro it's referring to is this one:
definition = {variable_definition}
| {syntax_definition}
| \(begin {definition}*\)
| \(let-syntax \({syntax_binding}*\){definition}*\)
| \(letrec-syntax \({syntax_binding}*\){definition}*\)
all of the macros defined here have been implemented but fro some reason I can't get rid of this error and I don't know why its happening.
A lex/flex/JFlex style "definition" is a macro expansion, as that error message indicates. Recursive macro expansions are impossible, since macro expansion is not conditional; trying to expand
definition = ... \(begin {definition}*\) ...
will result in an infinitely long regular expression.
Do not mistake a lexical analyser for a general-purpose parser. A lexical analyser does no more than split an input into individual tokens (or "lexemes"), using regular expressions to identify each token. Tokens do not have structure (at least for the purposes of parsing); once a token is identified, it is a single indivisible object. If you find yourself writing lexical descriptions which match structured text, you have almost certainly pushed the lexical analysis beyond its limits.
Parsers use an algorithm which does allow recursive descriptions (but which has very limited forward lookahead) and which can create a recursive description of the input (such as a parse tree).

What are valid identifiers in R7RS-small?

R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.

Gnu make with multiple % in pattern rule?

If I understand the make documentation correctly I can only use character % once in my target-pattern!
I am facing a condition which my target is look like:
$(Home)/ecc_%_1st_%.gff: %.fpp
And whenever I am trying to match /home/ecc_map_1st_map.gff: map.fpp
Make complains that can't find a pattern.
I know it is because of using % twice ($(Home)/ecc_map_1st_%.gff: %.fpp runes perfectly)
Is there anyway which I can use two % in one target!?
Thanks
You can't have two patterns. The best way would be not having map twice in the target name.
If you must, and you're using GNU make, secondary expansion might be useful:
.SECONDEXPANSION:
$(Home)/ecc_%.gff: $$(word 1,$$(subst _, ,$$*)).fpp
Normally you can't use automatic variables such as $* in prerequisites, but secondary expansion allows you to do just that. Note that I used $$ in place of $ so that the real expansion happens in the second pass and not in the first one.
Say you are making $(Home)/ecc_map_1st_map.gff. The $* variable is going to be the stem, i.e. map_1st_map. Then $(subst _, ,...) replaces underscores with spaces, to obtain map 1st map. Now $(word N,...) extracts the N-th space-separated word in there. First word would be the first map, second would be 1st, third would be the second map. I'm picking the first one.
Of course this is pretty lax and doesn't guarantee you'll match exactly the targets you wanted ($(Home)/ecc_map_2nd_notmap__foo.gff would match and give map.fpp). You could do further checks (1st in the middle, other two must match) and bail out:
extract-name = $(call extract-name-helper,$1,$(word 1,$(subst _, ,$1)))
extract-name-helper = $(if $(filter $2_1st_$2,$1),$2,$(error Invalid .gff target))
.SECONDEXPANSION:
ecc_%.gff: $$(call extract-name,$$*).fpp
But that doesn't look like a sensible solution.
Wrapping up, if don't have similarly named files, then the first solution would work well. Also, if you know the target names beforehand, you could make it safer with an explicit pattern rule (or even generate them on the fly).

Extract function names from function calls in C files

Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.

User-defined Literals suffix, with *_digit..."?

A user-defined literal suffix in C++0x should be an identifier that
starts with _ (underscore) (17.6.4.3.5)
should not begin with _ followed by uppercase letter (17.6.4.3.2)
Each name that [...] begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
Is there any reason, why such a suffix may not start _ followed by a digit? I.E. _4 or _3musketeers?
Musketeer dartagnan = "d'Artagnan"_3musketeers;
int num = 123123_4; // to be interpreted in base4 system?
string s = "gdDadndJdOhsl2"_64; // base64decoder
The precedent for identifiers of the form _<number> is the function argument placeholder object mechanism in std::placeholders (§20.8.9.1.3), which defines an implementation-defined number of such symbols.
This is a good thing, because it means the user cannot #define any identifier of that form. §17.6.4.3.1/1:
A translation unit that includes a standard library header shall not #define or #undef names declared in any standard library header.
The name of the user-defined literal function is operator "" _123, not simply _123, so there is no direct conflict between your name and the library name if presence of the using namespace std::placeholders;.
My 2¢, though, is that you would be better off with an operator "" _baseconv and encoding the base within the literal, "123123_4"_baseconv.
Edit: Looking at Johannes' (deleted) answer, there is There may be concern that _123 could be used as a macro by the implementation. This is certainly the realm of theory, as the implementation would have little to gain by such preprocessor use. Furthermore, if I'm not mistaken, the reason for hiding these symbols in std::placeholders, not std itself, is that such names are more likely to be used by the user, such as by inclusion of Boost Bind (which does not hide them inside a named namespace).
The tokens are not reserved for use by the implementation globally (17.6.4.3.2), and there is precedent for their use, so they are at least as safe as, say, forward.
"can" vs "may".
can denotes ability where may denotes permission.
Is there a reason why you would not have permission to the start a user-defined literal suffix with _ followed by a digit?
Permission implies coding standards or best-practices. The examples you provides seem to show that _\d would fine suffixes if used correctly (to denote numeric base). Unfortunately your question can't have a well thought out answer as no one has experience with this new language feature yet.
Just to be clear user-defined literal suffixes can start with _\d.
An underscore followed by a digit is a legal user-defined literal suffix.
The function signature would be:
operator"" _4();
so it couldn;t get eaten by a placeholder.
The literal would be a single preprocessor token:
123123_4;
so the _4 would not get clobbered by a placeholder or a preprocessor symbol.
My reading of 17.6.4.3.5 is that suffixes not containing a leading underscore risk collision with the implementation or future library additions. They also collide with existing suffixes: F, L, ULL, etc. One of the rationales for user-defined literals is that a new type (such as decimals for example) could be defined as a pure library extension including literals with suffuxes d, df, dl.
Then there's the question of style and readability. Personally, I think I would loose sight of the suffix 1234_3; Maybe, maybe not.
Finally, there was some idea that didn't make it into the standard (but I kind of like) to have _ be a literal separator for numbers like in Ada and Ruby. So you could have 123_456_789 to visually separate thousands for example. Your suffix would break if that ever went through.
I knew I had some papers on this subject:
Digital Separators describes a proposal to use _ as a digit separator in numeric literals
Ambiguity and Insecurity with User-Defined literals Describes the evolution of ideas about literal suffix naming and namespace reservation and efforts to deconflict user-defined literals against a future digit separator.
It just doesn't look that good for the _ digit separator.
I had an idea though: how about either a backslash or a backtick for digit separator? It isn't as nice as _ but I don't think there would be any collision as long as the backslash was inside the stream of digits. The backtick has no lexical use currently that I know of.
i = 123\456\789;
j = 0xface\beef;
or
i = 123`456`789;
j = 0xface`beef;
This would leave _123 as a literal suffix.

Resources