Adding keyword commands and functions to Textmate 2 bundle syntax - textmate

I want to add some additional syntax highlighting definitions to an existing bundle, but I need some general advice on how to do this. I'm not building a syntax from scratch, and I think my request is pretty simple, but I think it involves some subtleties for which I find the manual somewhat impenetrable in finding the answer.
Basically, I'm trying to fill out the syntax definitions for the Stata bundle. It's great, but there is no built in support for automatically highlighting the base commands and the installed functions, only a handful of basic control statements. Stata is a language which is primarily used by calling lots of different high level pre-defined command calls, like command foo bar, options(). The convention is that these command calls be highlighted.
There are a ton of these commands, and stubs which are used for convenience. Just the base install has almost 3500. Even optimizing them using the bundle helper, which obviously gets rid of the stub issue, still yields a massive regex list. I can easily cut this down to less than 1000 important ones, but its still a lot. There are also 350 "functions" which I would like to match with the syntax function()
I essentially have 3 questions:
Am I creating a serious problem by including a very comprehensive list of matching definitions?
How do I restrict a command to only highlight when it either begins a line or there is only whitespace between the beginning line and the command
What is the preferred way of restricting the list of functions() to only highlight when they have attached parentheses?

Related

How can one get a list of Mathematica's built-in global rewrite rules?

I understand that over a thousand built-in rewrite rules in Mathematica populate the global rules table by default. Is there any way to get Mathematica to give a full or even partial list of those rules?
The best way is to get a job at Wolfram Research.
Failing that, I think that for things not completely compiled into the kernel you can recover most of the rules/definitions. Look at
Attributes[fn]
where fn is the command that you're interested in. If it returns
{Protected, ReadProtected}
then there's something you can get a look at (although often it's just a MakeBoxes (formatting) definition or a AutoLoad/Stub type definition). To see what's there run
Unprotect[fn];
ClearAttributes[fn, ReadProtected];
??fn
Quite often you'll have to run an example of the command to load it if it was a stub. You'll also have to dig down from the user-facing commands to the back-end implementations.
Eventually you'll most likely reach a core command that is compiled into the kernel that you can not see the details of.
I previously mentioned this in tips for creating Graph diagrams and it got a mention in What is in your Mathematica tool bag?.
An good example, with a nice bite-sized and digestible bit of code is Experimental`AngularSlider[] mentioned in Circular/Angular slider. I'll leave it up to you to look at the code produced.
Another example is something like BoxWhiskerChart, where you need to call it once in order to load all of the code. Then you see that BoxWhiskerChart proceeds to call Charting`iBoxWhiskerChart which you'll have to unprotect to look at, etc...

Using strings instead of symbols: good or evil?

Often enough, I find myself dealing with lists of function options (or more general replacement lists) of the form {foo->value,...}. This leads to bugs when foo already has a value in $Context. One obvious way to prevent this is using a string "foo" instead of the symbol: {"foo"->value,...}. This works, but seems to draw ire of some seasoned LISPers I know, who chastise me for conflating symbols and strings and tell me to use built-in quoting constructs.
While it is certainly possible to write code that avoids collisions without using strings, it often seems more trouble than it is worth. On the other hand, I haven't seen too many examples of {"string"->value} type replacement rules. So the question to you is -- is this an acceptable usage pattern?.. Are there cases where it is particularly appropriate?.. Where should it be avoided?..
In my opinion (disclaimer - it is only my opinion), it is best to avoid using strings as option names, at least for "main" options in your function. Strings OTOH are totally fine as settings (r.h.s. of options). This is not to say that you can not use strings, just as you noted. Perhaps, they could be more appropriate for sub-options, and they are used in this way by many system functions (usually "superfunctions" like NDSolve, that may have sub-options within options). The main problems I see with using strings is that they reduce the introspection capabilities, both for the system and for the user. In other words, it is harder to discover an option that has a string name than that with a symbol name - for the latter I can just inspect the names of the symbols in a package, and also symbolic option names have usage messages. You may also want to automate some things, such as writing a utility that finds all option names in the package etc. It is easier to do when option names are symbols, since they all belong to the same context. It is also easy to discover that some options do not have usage messages, one can do that automatically by writing a utility function.
Finally, you may have a better protection against accidental collisions of similar option names. It may be, that many option sequences are passed to your function, and occasionally they may contain options with the same name. If option names were symbols, full symbol names would be different. Then, you will both get a shadowing warning, and at the same time a protection - only the correct option (full) name will be used. For string, you don't get any warning, and may end up using incorrect option setting, if the duplicate string option name with a wrong setting (intended for a different function, say) happens to be first in the list. This scenario is more likely to occur in larger projects, but bugs like this are probably very hard to catch (this is a guess, I never had such situation).
As for possible collisions, if you follow some naming conventions such as option name always starting with a capital letter, plus put most of your code in packages, and do not start your variable or function names (for functions in the interactive session), with a capital letter, then you will greatly reduce the chance of such collisions. Additionally, you should Protect option names, when you define them, or at the end of the package. Then, the collisions will be detected as cases of shadowing. Avoiding shadowing, OTOH, is a general necessity, so the case of options is no more special in this respect than for function names etc.

Code duplication refactoring tool for VB

I need a very specific tool for VB (or multi-language). I thought I would ask if one already exists, before I start making one myself (probably, in python).
What I need:
The tool must crawl, recursivelly or not, a path, searching for a list of extension, such as .bas, .frm, .xxx
Then, It has to parse that files, searching for functions, routines, etc.
And finally, it must output what it found.
I based this on the idea of, "reducing code redundance", in an scenario where, bad programmers make a lot of functions that do the same thing, sometimes with the same name, sometimes not. There are 4 cases:
Case 1: Same name, Same content.
Case 2: Same name, Diff content.
Case 3: Diff name, Same content.
Case 4: Diff name, Diff Content.
So, the output, should be something like this
===========================================================================
RESULT
===========================================================================
Errors:
---------------------------------------------------------------------------
==Name, ==Content --> 3: (Func(), Foo(), Bar()) In files (f,f2,f3)
!=Name, ==Content --> 2: (Func() + Func1(), Bar() + Bar1()) In Files (f4)
---------------------------------------------------------------------------
Warnings:
==Name, !=Content --> 1 (Foobar()) In Files (f19)
---------------------------------------------------------------------------
This is to give you an idea of what I need.
So, the question is: is there any tool that acomplish something similar to this???
P.S: Yes, we should write good code, in first instance, but, you know, stuff happens.
What you want is a "clone detector". These tools find copy-and-pasted code across a large set of designated files. Clones are not just of functions; they can be code blocks, data declarations, etc.
There are a variety of detectors out there, and you should know how they work before you attempt to build one of your own.
Some simply match lines for exact equivalence. While these demonstrate the basic idea, their detection is not good because they don't take into account the fact that cloned code often has variations; what people really do is clone-and-edit when making copies.
Some match sequences of langauge tokens, e.g., identifiers, keywords, literals, punctuation. These at least are relatively tolerant of whitespace changes. And they can find clones in which single tokens have been substituted for single tokens. However, because they don't understand language structure (blocks, statements, function bodies) they often match sequences that cross such structure boundaries (e.g., "} {" is often considered a clone by these tools), they produce rather high false-positive indications of (non)clones. Some of these attempt to limit the matches to key program structures, such as complete functions, as you have kind of suggested.
More sophisticated detectors match program structures.
Our CloneDR (I'm the original author) is a detector that
uses compiler-quality parsing to abstract syntax trees, which extracts the precise structure of the code. It does this for many languages (including VB6 and VBScript), locating clones as arbitrary functions, blocks, statements or declarations, with parameters shows how the clones vary. CloneDR can find clones in spite of formatting changes, changes in comment locations or content, and even variations where complex constructs (multiple statements or expressions) have been used as alternatives to simple ones (e.g., a single statment or a literal). While it tends to have a high detection rate(it usually finds 10-20% removable redundancy!), its false-positive rate tends to be considerably lower than the token based detectors. You can see sample reports for
a variety of different langauges at the link above.
See Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach which explicitly discusses different approaches and benefits, and compares a large number of detectors including CloneDR.
EDIT October 2010: ... When I first wrote this response, I assumed the OP was interested in VB.net, which CloneDR didn't do. We've since added VB.net, VB6 and VBScript capability to CloneDR. (Parsing VB.net in its modern form is a lot messier than one might imagine for "simple"(!) langauge like Visual Basic).

Cross version line matching

I'm considering how to do automatic bug tracking and as part of that I'm wondering what is available to match source code line numbers (or more accurate numbers mapped from instruction pointers via something like addr2line) in one version of a program to the same line in another. (Assume everything is in some kind of source control and is available to my code)
The simplest approach would be to use a diff tool/lib on the files and do some math on the line number spans, however this has some limitations:
It doesn't handle cross file motion.
It might not play well with lines that get changed
It doesn't look at the information available in the intermediate versions.
It provides no way to manually patch up lines when the diff tool gets things wrong.
It's kinda clunky
Before I start diving into developing something better:
What already exists to do this?
What features do similar system have that I've not thought of?
Why do you need to do this? If you use decent source version control, you should have access to old versions of the code, you can simply provide a link to that so people can see the bug in its original place. In fact the main problem I see with this system is that the bug may have already been fixed, but your automatic line tracking code will point to a line and say there's a bug there. Seems this system would be a pain to build, and not provide a whole lot of help in practice.
My suggestion is: instead of trying to track line numbers, which as you observed can quickly get out of sync as software changes, you should decorate each assertion (or other line of interest) with a unique identifier.
Assuming you're using C, in the case of assertions, this could be as simple as changing something like assert(x == 42); to assert(("check_x", x == 42)); -- this is functionally identical, due to the semantics of the comma operator in C and the fact that a string literal will always evaluate to true.
Of course this means that you need to identify a priori those items that you wish to track. But given that there's no generally reliable way to match up source line numbers across versions (by which I mean that for any mechanism you could propose, I believe I could propose a situation in which that mechanism does the wrong thing) I would argue that this is the best you can do.
Another idea: If you're using C++, you can make use of RAII to track dynamic scopes very elegantly. Basically, you have a Track class whose constructor takes a string describing the scope and adds this to a global stack of currently active scopes. The Track destructor pops the top element off the stack. The final ingredient is a static function Track::getState(), which simply returns a list of all currently active scopes -- this can be called from an exception handler or other error-handling mechanism.

Standards Document

I am writing a coding standards document for a team of about 15 developers with a project load of between 10 and 15 projects a year. Amongst other sections (which I may post here as I get to them) I am writing a section on code formatting. So to start with, I think it is wise that, for whatever reason, we establish some basic, consistent code formatting/naming standards.
I've looked at roughly 10 projects written over the last 3 years from this team and I'm, obviously, finding a pretty wide range of styles. Contractors come in and out and at times, and sometimes even double the team size.
I am looking for a few suggestions for code formatting and naming standards that have really paid off ... but that can also really be justified. I think consistency and shared-patterns go a long way to making the code more maintainable ... but, are there other things I ought to consider when defining said standards?
How do you lineup parenthesis? Do you follow the same parenthesis guidelines when dealing with classes, methods, try catch blocks, switch statements, if else blocks, etc.
Do you line up fields on a column? Do you notate/prefix private variables with an underscore? Do you follow any naming conventions to make it easier to find particulars in a file? How do you order the members of your class?
What about suggestions for namespaces, packaging or source code folder/organization standards? I tend to start with something like:
<com|org|...>.<company>.<app>.<layer>.<function>.ClassName
I'm curious to see if there are other, more accepted, practices than what I am accustomed to -- before I venture off dictating these standards. Links to standards already published online would be great too -- even though I've done a bit of that already.
First find a automated code-formatter that works with your language. Reason: Whatever the document says, people will inevitably break the rules. It's much easier to run code through a formatter than to nit-pick in a code review.
If you're using a language with an existing standard (e.g. Java, C#), it's easiest to use it, or at least start with it as a first draft. Sun put a lot of thought into their formatting rules; you might as well take advantage of it.
In any case, remember that much research has shown that varying things like brace position and whitespace use has no measurable effect on productivity or understandability or prevalence of bugs. Just having any standard is the key.
Coming from the automotive industry, here's a few style standards used for concrete reasons:
Always used braces in control structures, and place them on separate lines. This eliminates problems with people adding code and including it or not including it mistakenly inside a control structure.
if(...)
{
}
All switches/selects have a default case. The default case logs an error if it's not a valid path.
For the same reason as above, any if...elseif... control structures MUST end with a default else that also logs an error if it's not a valid path. A single if statement does not require this.
In the occasional case where a loop or control structure is intentionally empty, a semicolon is always placed within to indicate that this is intentional.
while(stillwaiting())
{
;
}
Naming standards have very different styles for typedefs, defined constants, module global variables, etc. Variable names include type. You can look at the name and have a good idea of what module it pertains to, its scope, and type. This makes it easy to detect errors related to types, etc.
There are others, but these are the top off my head.
-Adam
I'm going to second Jason's suggestion.
I just completed a standards document for a team of 10-12 that work mostly in perl. The document says to use "perltidy-like indentation for complex data structures." We also provided everyone with example perltidy settings that would clean up their code to meet this standard. It was very clear and very much industry-standard for the language so we had great buyoff on it by the team.
When setting out to write this document, I asked around for some examples of great code in our repository and googled a bit to find other standards documents that smarter architects than I to construct a template. It was tough being concise and pragmatic without crossing into micro-manager territory but very much worth it; having any standard is indeed key.
Hope it works out!
It obviously varies depending on languages and technologies. By the look of your example name space I am going to guess java, in which case http://java.sun.com/docs/codeconv/ is a really good place to start. You might also want to look at something like maven's standard directory structure which will make all your projects look similar.

Resources