In my understanding, when Matz invented Ruby, he pretty much lifted a lot of Perl language constructs and duplicated them. Does this extend to regular expressions as well, or are there any syntactical differences that I should be aware of?
There's an extensive comparison of regex support in many languages at https://www.regular-expressions.info/refbasic.html and its sibling pages. Enter the two languages that you want to compare w.r.t. support of regex capabilities, and see the differences in the table.
Related
I need to implement an algorithm for multiple extended string matching in text.
Extended means the presence of wildcards (any number of characters instead of a star), for example:
abc*def //matches abcdef, abcpppppdef etc.
Multiple means that the search is going on simultaneously for multiple string patterns (not a separate search for each pattern), for example:
abc*def
abc
whatever
some*string
QUESTION:
What is the fast algorithm that can do multiple extended string matching?
Preferably, optimized for SIMD instructions and multicore implementation. Open source implementation (C/C++/Python) would be great as well.
Thank you
I think that it might make sense to start by reading the following Wikipedia article's section: http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times. You can then perform a literature review on algorithms, implementing regular expression pattern matching.
In terms of practical implementation, there is a large variety of regular expression (regex) engines in a form of libraries, focused on one or more programming languages. Most likely, the best and most popular option is the C/C++ PCRE library, with its newest version PCRE2, released in 2015. Another C++ regex library, which is quite popular at Google, is RE2. I recommend you to read this paper, along with the two other, linked within the article, for details on algorithms, implementation and benchmarks. Just recently, Google has released RE2/J - a linear time version of RE2 for Java: see this blog post for details. Finally, I ran across an interesting pure C regex library TRE, which offers way too many cool features to list here. However, you can read about them all on this page.
P.S. If the above is not enough for you, feel free to visit this Wikipedia page for details of many more regex engines/libraries and their comparison across several criteria. Hope my answer helps.
I saw that they were documented together here. Are they the same thing? Why does Ruby have so many aliases (such as map/collect for arrays)? Thanks a lot.
Yes, and it's also called fold in many other programming languages and in Mathematics. Ruby aliases a lot in order to be intuitive to programmers with different backgrounds. If you want to use #length on an Array, you can. If you want to use #size, that's fine too!
More recent versions of the documentation of Enumerable#reduce specify it explicitly:
The inject and reduce methods are aliases. There is no performance benefit to either.
Are they the same thing?
Yes, aliases run the exact same code in the end.
Why does Ruby have so many aliases (such as map/collect for arrays)?
It boils down to the language's approach
Different languages have different approaches, I tried to visualize it here:
Ruby does it in favor of developer productivity. Basically, by having aliases you give programmers from different programming languages and human languages backgrounds to write code more intuitively.
However, they can also help your code's clarity because some things may have different semantic possibilities like the method midnight() can also be expressed as start_of_day or end_of_day. Those can be more clear depending on the context.
By the way, some programmers use inject and reduce to differentiate between different semantic situations too.
I'm writing some Extended Backus–Naur Form grammars for document parsing. There are lots of excellent guides for the syntax of these definitions, but very little online about how to design and structure them.
Can anyone suggest good articles (or general tips) about how you like to approach writing these as there does seem to be an element of style even if the final parse trees can be equivalent.
e.g. things like:
Deciding if you should explicitly tag newlines, or just treat it as whitespace?
Naming schemes for your nonterminals
Handing optional whitespace in long definitions
When to use bad syntax checks vs just letting those not match
Thanks,
You should work in the direction that you are most comfortable with - either bottom-up, top-down, or "sandwich" (do a little of both, meet somewhere in the middle).
Any "group" that can be derived and has a meaning of its own, should start from it's own non-terminal. So for example, I would use a non-terminal for all newline-related whitespaces, one for all the other whitespaces, and one for all whitespaces (which is basically the union of the former 2).
Naming conventions in grammars in general are that non-terminals are, or start with, a capital letter, and terminals start with non-capitals (but this of course depends on the language you're designing).
Regarding bad syntax checks, I'm not familiar with the concept. What I know of EBNFs are that you just write everything your language accepts, and only that.
Generally, just look around at some EBNFs of different languages from different websites, get a feeling of how they look, and then do what feels right to you.
I imagine Scheme (and perhaps Lisp) could be made more `user friendly' by using a different syntax. For example, instead of nested S-expressions with ugly parentheses, one could devise some kind of syntax closer to some of the more widely used languages (e.g. Java-like without needing to define classes).
It's not necessarily a bad thing if it's more verbose. For example, the syntax may require line separators and commas in the places where many people will expect them, and expect explicit return statements. Also, it doesn't seem that difficult to allow some operators to be used infix style (just obey the generally accepted operator preference rules).
And if it doesn't make things too messy, the syntax could even be backwards-compatible, so that in any place where an expression is expected, a normal S-expression between parentheses can be used.
What are your opinions and ideas about this? And does anything like this exist? (I expect it does, but "Scheme" is a worthless google term, I can't find anything!)
Originally, Lisp was planned to use a syntax called M-Expressions, with S-Expressions being only a transitional solution for easier compiler building. When M-Expressions were ready to be introduces, the programmers who had already taken on Lisp just stayed with what they had become accustomed to, and M-Expressions never caught on.
There is an infix notation in Guile, but it's rarely used. A good Lisp programmer doesn't even see the parens anymore, and prefix notation does have its merits...
I think "sweet expressions" might be one of the more thoughtful approaches to getting rid of the parentheses in Lisp. It apparently even supports macros.
http://www.dwheeler.com/readable/sweet-expressions.html
However, I think most people eventually get over the parentheses or use another language.
Take a look at "sweet-expressions", which provides a set of additional abbreviations for traditional s-expressions. They add syntactically-relevant indentation, a way to do infix, and traditional function calls like f(x). Unlike nearly all past efforts to make Lisps readable, sweet-expressions are backwards-compatible (you can freely mix well-formatted s-expressions and sweet-expressions), generic, and homoiconic.
Sweet-expressions were developed on http://readable.sourceforge.net and there is a sample implementation.
For Scheme there is a SRFI for sweet-expresssions: http://srfi.schemers.org/srfi-110/
Try SRFI 49 for size. :-P
(Seriously, though, as Rafe commented, "I don't think anybody wants this".)
Some people consider Python to be a kind of Scheme with infix notation for operators, algebraic notation for functions and which uses a more "java-like" syntax for representing the language. I don't agree with that assessment, but I can see where the idea comes from.
The big problem with changing the notation for Scheme is that macros become very hard to write (to see how hard, take a look at the Nimrod language or Boo). Instead of working directly with the code as lists, you have to parse the input language first. This usually involves constructing an AST (abstract syntax tree) for the language from the input. When working directly with Scheme, this is unnecessary.
However, you might check out the SIX expression syntax in Gambit Scheme. There's a nice set of slides here which contains a discussion of this:
http://www.iro.umontreal.ca/~gambit/Gambit-inside-out.pdf
But don't tell anyone about it! (The inside joke is that someone suggests writing a Lisp without parentheses and with infix notation about once a day, and someone announces an implementation about once a month.)
There are some languages that do exactly that. For instance: Dylan.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
In my day job I, and others on my team write a lot of hardware models in Verilog-AMS, a language supported primarily by commercial vendors and a few opensource simulator projects.
One thing that would make supporting each others code more helpful would be a LINTER that would check our code for common problems and assist with enforcing a shared code formatting style.
I of course want to be able to add my own rules and, after I prove their utility to myself, promote them to the rest of the team..
I don't mind doing the work that has to be done, but of course also want to leverage the work of other existing projects.
Does having the allowed language syntax in a yacc or bison format give me a leg up?
or should I just suck each language statement into a perl string, and use pattern matching to find the things I don't like?
(most syntax and compilation errors are easily caught by the commercial tools.. but we have some of our own extentions.)
lex/flex and yacc/bison provide easy-to-use, well-understood lexer- and parser-generators, and I'd really recommend doing something like that as opposed to doing it procedurally in e.g. Perl. Regular expressions are powerful stuff for ripping apart strings with relatively-, but not totally-fixed structure. With any real programming language, the size of your state machine gets to be simply unmanageable with anything short of a Real Lexer/Parser (tm). Imagine dealing with all possible interleavings of keywords, identifiers, operators, extraneous parentheses, extraneous semicolons, and comments that are allowed in something like Verilog AMS, with regular expressions and procedural code alone.
There's no denying that there's a substantial learning curve there, but writing a grammar that you can use for flex and bison, and doing something useful on the syntax tree that comes out of bison, will be a much better use of your time than writing a ton of special-case string-processing code that's more naturally dealt with using a syntax-tree in the first place. Also, what you learn writing it this way will truly broaden your skillset in ways that writing a bunch of hacky Perl code just won't, so if you have the means, I highly recommend it ;-)
Also, if you're lazy, check out the Eclipse plugins that do syntax highlighting and basic refactoring for Verilog and VHDL. They're in an incredibly primitive state, last I checked, but they may have some of the code you're looking for, or at least a baseline piece of code to look at to better inform your approach in rolling your own.
I've written a couple verilog parsers and I would suggest PCCTS/ANTLR if your favorite programming language is C/C++/Java. There is a PCCTS/ANTLR Verilog grammar that you can start with. My favorite parser generator is Zebu which is based on Common Lisp.
Of course the big job is to specify all the linting rules. It makes sense to make some kind of language to specify the linting rules as well.
Don't underestimate the amount of work that goes into a linter. Parsing is the easy part because you have tools (bison, flex, ANTLR/PCCTS) to automate much of it.
But once you have a parse, then what? You must build a semantic tree for the design. Depending on how complicated your inputs are, you must elaborate the Verilog-AMS design (i.e. resolving parameters, unrolling generates, etc. If you use those features). And only then can you try to implement rules.
I'd seriously consider other possible solutions before writing a linter, unless the number of users and potential time savings thereby justify the development time.
In trying to find my answer, I found this on ANTLR - might be of use
If you use Java at all (and thus IDEA), the IDE's extensions for custom languages might be of use
yacc/bison definitely gives you a leg up, since good linting would require parsing the program. Regex (true regex, at least) might cover trivial cases, but it is easy to write code that the regexes don't match but are still bad style.
ANTLR looks to be an alternative path to the more common (OK I heard about them before) YACC/BISON approach, which it turns out also commonly use LEX/FLEX as a front end.
a Quick read of the FLEX man page kind of make me think It could be the framework for that regex type of idea..
Ok.. I'll let this stew a little longer, then see how quickly I can build a prototype parser in one or the other.
and a little bit longer