I'm creating a JSON-RPC library and I need to know what characters are allowed as Haxe method names. I know that I must be cautious of allowed method names in the target languages as well, but I'm trying to start with Haxe : )
Alpha or underscore for the first symbol and alpanumeric or underscore for the next symbols.
[a-zA-Z_][a-zA-Z0-9_]*
is the regexp for haxe method name.
Related
Isn’t list a keyword to create a new list in Lisp, but yet it is possible to have an argument called list in Lisp. I thought keywords in most programming languages such as Java or C++ cannot be used for argument names, is there a special reason in Lisp that they can?
The name list isn't a reserved keyword, it's an ordinary function. Reusing the name for another purpose can be confusing for the reader but doesn't present any problems for the language itself; it's the same as having two variables called x in different parts of the program.
Mainstream Lisp descendants and derivatives like Commmon Lisp and Scheme do not incorporate the concept of reserved keywords. It is alien to the way Lisp works.
When Lisp read syntax is scanned, identifier tokens which appear in it are converted into corresponding symbol objects. These tokens are all in the same lexical category: symbol.
When Lisp read syntax is scanned and turned into an object, such as a nested list representing program code, this is done without regard for the semantics (what the symbols mean).
This is different from the parsing of languages (such as some of those in the broad Fortran/Algol family) which have reserved keywords.
Roughly speaking, reserved keywords are tokens which look like symbols but are actually just punctuation. Lisp has punctuation also, like parentheses, sharpsign prefixes, various quotes and such.
These punctuation words have a fixed role in the phrase structure grammar, and the phrase structure grammar must be processed before the semantics of the program can be considered.
So for instance, the reserved BEGIN and END keywords in Pascal are essentially nothing more than verbose parentheses. The '(' and ')' tokens are similarly reserved in Lisp-like languages. Trying to use BEGIN as the name of a function or variable in Pascal is similar to trying to use ( as the name of a function or variable in Lisp.
Some languages have keywords which determine phrase structure, yet allow identifiers which look exactly like reserved keywords to be used anyway. For instance, PL/I was famous for this:
IF IF=THEN THEN THEN=ELSE; ELSE ELSE=IF
Lisp dialects may assign special semantic treatment to certain symbols or certain categories of symbols. This is a sort of reservation, but not exactly the same as reserved keywords, because it is at the semantic level. For instance, in Common Lisp, the symbols nil and t (more specifically the nil and t in the common-lisp package, common-lisp:nil and common-lisp:t) may not be used as function or variable names. When either one appears as an expression, it evaluates to itself: the value of t is t and that of nil is nil. Moreover, nil is also the Boolean false value and the empty list. So, effectively, these symbols are reserved in some regards. Common Lisp also has a keyword package. All symbols in that package evaluate to themselves and may not be used as variables. They may be used as function names, and for any other purpose.
You say Lisp, but the answer changes depending on which Lisp you're talking about.
In Common Lisp, you can use list as a variable because Common Lisp is a Lisp-2, meaning that each symbol has a separate slot for a function binding and a variable binding. Common Lisp sets the function binding for the symbol list in the CL package, but doesn't set the variable binding. You can't change the function binding because Common Lisp doesn't allow you to redefine bindings for symbols that are set in the CL package (you can, of course, use whatever symbols you like in your own packages), but since the variable binding is free you're allowed to use it.
Scheme is a Lisp-1, which means that it only has one binding per symbol. There's no separation of function bindings and variable bindings (hence why you use define in Scheme, but defun and defvar in CL). The reason you can use "list" as a variable is because Scheme doesn't prevent you from rebinding its built-in symbols. It's just generally a bad idea, since by redefining list you can no longer call the list function.
Emacs Lisp is a Lisp-2 but doesn't prevent you from rebinding symbols, which means you can do things like (defun + (- a b)) and totally screw up your editing session. So... don't do that, unless you really know what you're doing.
Clojure is a Lisp-1. I don't have a working Clojure install at the moment so I can't comment on what it lets you do. I would suspect it's more strict than Scheme.
I am curious about parsing C++ code using regexp. What I have so far (using ruby) allows me to extract class declarations and their parent classes (if any):
/(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\s*\{/
Here is an example in Rubular. Notice I can capture correctly the "declaration" and "inheritance" parts.
The point at where I am stuck is at capturing the class body. If I use the following extension of the original regex:
/(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\s*\{[^}]*\};/
Then I can capture the class body only if it does not contain any curly braces, and therefore any class or function definition.
At this point I have tried many things but none of them make this better.
For instance, if I include in the regexp the fact that the body can contain braces, it will capture the first class declaration and then all the subsequent classes as if they were part of the first class' body!
What am I missing?
Regular expressions are not the recommended way to parse code.
Most compilers and interpreters use lexers and parsers to convert code into an abstract syntax tree before compiling or running the code.
Ruby has a few lexer gems, like this, you can try and incorporate into your project.
The group capturing might help:
# named v backref v
/(struct|class)\s+(?<match>{((\g<match>|[^{}]*))*})/m
Here we find the matching curly bracket for the one following struct/class declaration. You probably will want to tune the regexp, I posted this to make the solution as clear as possible.
What I can offer you is this:
(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\{([^{}]|\{\g<4>\})*\};
Where \g<4> is a recursive application of the fourth capture group, which is ([^{}]|\{\g<4>\}).
Matching non-regular languages with regular expressions is never pretty. You might want to consider switching to a proper recursive descent parser, especially if you plan to do something with the stuff you just captured.
In some gcc code I came across the following construct.
fatal (_("%s: cannot find section %s"), file_name, section_name);
I have never seen "_" in this context.
It is some sort of construct to create an entity from the character array, very probably a compiler extension.
Can someone tell me what it is?
It is usually a macro associated with the GNU gettext project, used for internationalization. The idea is the passed string is a key in a lookup table. There is one such table for each supported language, with the current one decided by handful of environmental factors.
The value found in the table should be a translation of the key, into the target language.
Since looking up such translated strings is a common activity in i18n code, _ is introduced as a convenient, short name for the lookup function.
If I have an identifier with a same name as existing keyword, how do I escape it?
That's what I found (and this is probably the final answer):
It is possible to use # as a prefix in identifier names. However, by default it creates a different identifier (#a != a).
Since # is allowed, it is possible to add a new compiler step to the pipeline that will do TrimStart('#') on all identifiers. It works ok, you will just have to remember all types of things that have names.
If you are using Rhino.DSL, it has a UseSymbols step that converts #a into 'a', which had confused me a lot (I was working with project that included this step by default).
I don't think anything like the C# # prefix is implemented in Boo... but I'm pretty sure it could be achieved by inserting a custom compiler step to the beginning of the compiler pipeline.
The use of symbol literals is not immediately clear from what I've read up on Scala. Would anyone care to share some real world uses?
Is there a particular Java idiom being covered by symbol literals? What languages have similar constructs? I'm coming from a Python background and not sure there's anything analogous in that language.
What would motivate me to use 'HelloWorld vs "HelloWorld"?
Thanks
In Java terms, symbols are interned strings. This means, for example, that reference equality comparison (eq in Scala and == in Java) gives the same result as normal equality comparison (== in Scala and equals in Java): 'abcd eq 'abcd will return true, while "abcd" eq "abcd" might not, depending on JVM's whims (well, it should for literals, but not for strings created dynamically in general).
Other languages which use symbols are Lisp (which uses 'abcd like Scala), Ruby (:abcd), Erlang and Prolog (abcd; they are called atoms instead of symbols).
I would use a symbol when I don't care about the structure of a string and use it purely as a name for something. For example, if I have a database table representing CDs, which includes a column named "price", I don't care that the second character in "price" is "r", or about concatenating column names; so a database library in Scala could reasonably use symbols for table and column names.
If you have plain strings representing say method names in code, that perhaps get passed around, you're not quite conveying things appropriately. This is sort of the Data/Code boundary issue, it's not always easy to the draw the line, but if we were to say that in that example those method names are more code than they are data, then we want something to clearly identify that.
A Symbol Literal comes into play where it clearly differentiates just any old string data with a construct being used in the code. It's just really there where you want to indicate, this isn't just some string data, but in fact in some way part of the code. The idea being things like your IDE would highlight it differently, and given the tooling, you could refactor on those, rather than doing text search/replace.
This link discusses it fairly well.
Note: Symbols will be deprecated and then removed in Scala 3 (dotty).
Reference: http://dotty.epfl.ch/docs/reference/dropped-features/symlits.html
Because of this, I personally recommend not using Symbols anymore (at least in new scala code). As the dotty documentation states:
Symbol literals are no longer supported
it is recommended to use a plain string literal [...] instead
Python mantains an internal global table of "interned strings" with the names of all variables, functions, modules, etc. With this table, the interpreter can make faster searchs and optimizations. You can force this process with the intern function (sys.intern in python3).
Also, Java and Scala automatically use "interned strings" for faster searchs. With scala, you can use the intern method to force the intern of a string, but this process don't works with all strings. Symbols benefit from being guaranteed to be interned, so a single reference equality check is both sufficient to prove equality or inequality.