Syntax coloration on Monaco Editor (Monarch), '#popall' when new line encountered - syntax-highlighting

I am currently writing a syntax highlighter for Monaco Editor, using Monarch.
I am using the states in order to deal differently with tokens depending on where they are in the line.
What I would like to do is #popall the states when I reach the end of the line, as all lines are independant.
Right now the only way I found is to add conditions at the end of all my rules, something like that:
[/\}/, {cases: {'#eos':{token: 'keyword', next:'#popall'},
'#default':{token: 'keyword', next:'#pop'}}}],
which is really redundant obviously as my 50+ rules have this case.
What is the clean way of doing this?

Related

Xtext - Multiline String like in YAML

I'm trying to model a YAML-like DSL in Xtext. In this DSL, I need some Multiline String as in YAML.
description: |
Line 1
line 2
...
My first try was this:
terminal BEGIN:
'synthetic:BEGIN'; // increase indentation
terminal END:
'synthetic:END'; // decrease indentation
terminal MULTI_LINE_STRING:
"|"
BEGIN ANY_OTHER END;
and my second try was
terminal MULTI_LINE_STRING:
"|"
BEGIN
((!('\n'))+ '\n')+
END;
but both of them did not succeed. Is there any way to do this in Xtext?
UPDATE 1:
I've tried this alternative as well.
terminal MULTI_LINE_STRING:
"|"
BEGIN ->END
When I triggered the "Generate Xtext Artifacts" process, I got this error:
3492 [main] INFO nerator.ecore.EMFGeneratorFragment2 - Generating EMF model code
3523 [main] INFO clipse.emf.mwe.utils.GenModelHelper - Registered GenModel 'http://...' from 'platform:/resource/.../model/generated/....genmodel'
error(201): ../.../src-gen/.../parser/antlr/lexer/Internal..Lexer.g:236:71: The following alternatives can never be matched: 1
error(3): cannot find tokens file ../.../src-gen/.../parser/antlr/internal/Internal...Lexer.tokens
error(201): ../....idea/src-gen/.../idea/parser/antlr/internal/PsiInternal....g:4521:71: The following alternatives can never be matched: 1
This slide deck shows how we implemented a whitespace block scoping in an Xtext DSL.
We used synthetic tokens called BEGIN corresponding to an indent, and END corresponding to an outdent.
(Note: the language was subsequently renamed to RAPID-ML, included as a feature of RepreZen API Studio.)
I think your main problem is, that you have not defined when your multiline token is ending. Before you come to a solution you have to make clear in your mind how an algorithm should determine the end of the token. No tool can take this mental burdon from you.
Issue: There is no end marking character. Either you have to define such a character (unlike YAML) or define the end of the token in anather way. For example through some sort of semantic whitespace (I think YAML does it like that).
The first approach would make the thing very easy. Just read content until you find the closing character. The sescond approach would probably be manageable using a custom lexer. Basically you replace the generated lexer with your own implemented solution that is able to cound blanks or similar.
Here are some starting points about how this could be done (different approaches thinkable):
Writing a custom Xtext/ANTLR lexer without a grammar file
http://consoliii.blogspot.de/2013/04/xtext-is-incredibly-powerful-framework.html

Undefined procedure error in Prolog when using R-2-L words

I want to make an Arabic morphological analyzer using Prolog.
I have implemented the following code.
check(ي,1,male).
check(ت,1,female).
check(ا,1,me).
dict(لعب,3).
ending('',0,single).
ending(ون,2,plur).
parse([]).
parse(Word,Gender,Verb,Plurality):-
sub_atom(Word,0,LenHead,_,FirstCut),
check(FirstCut,LenHead,Gender),
sub_atom(Word,LenHead,_,LenAfter,Verb),
dict(Verb,LenOfVerb),
Location is LenHead+LenOfVerb,
sub_atom(Word,Location,LenAfter,_,EndOfWord),
ending(EndOfWord,_,Plurality).
This is called using:
parse(يلعب,A,S,D).
Expectation:
A = male
S = لعب
D = single
Explanation of code:
It should parse the word يلعب, note that in Arabic the ي (first letter to the right) indicates that it's masculine word. And لعب is a verb.
Error:
When running the code, I get the following error:
ERROR: parse/4: Undefined procedure: dict/2
Note that when mimicking the Arabic word using English letters, the code behaves as expected and doesn't produce this error.
How can I resolve such error, or make the Prolog understand R-to-L words?
Edit:
In the attached image, note that in the red box, it succeeded to match the ي to male. In the blue box, when it failed, it should have backtracked and starts to concatenate to try to match a new word, but instead it produces the error shown
You have to be careful when you are using SWI-Prolog on the Mac. There is a slight problem with copy paste. If you use [user], and then past multiple lines, it doesn't read all lines:
This happens all the time and isn't related to the arabic script or unicode, or somesuch. I have filed a bug report to SWI Prolog here. When you use [user], and do the lines one by one you get the right result.
In the above screenshot you see that I did a one by one paste, since there are multiple prompts '|:'. Other Prolog systems don't have necessarely this problem, for example I get in Jekejeke Prolog:
Best workaround for SWI-Prolog is probably to store the facts in a file, and consult them from there. In Jekejeke Prolog I have to investigate, why the space after the comma is showing on the wrong side.

Find HTML Tags in Properties

My current issue is to find HTML-Tags inside of property values. I thought it would be easy to search with a query like /jcr:root/content/xgermany//*[jcr:contains(., '<strong>')] order by #jcr:score
It looks like there is a problem with the chars < and > because this query finds everything which has strong in it's property. It finds <strong>Some Text</strong> but also This is a strong man.
Also the Query Builder API didn't helped me.
Is there a possibility to solve it with a XPath or SQL Query or do I have to iterate through the whole content?
I don't fully understand why it finds This is a strong man as a result for '<strong>', but it sounds like the unexpected behavior comes from the "simple search-engine syntax" for the second argument to jcr:contains(). Apparently the < > are just being ignored as "meaningless" punctuation.
You could try quoting the search term:
/jcr:root/content/xgermany//*[jcr:contains(., '"<strong>"')]
though you may have to tweak that if your whole XPath expression is enclosed in double quotes.
Of course this will not be very robust even if it works, since you're trying to find HTML elements by searching for fixed strings, instead of actually parsing the HTML.
If you have an specific jcr:primaryType and the targeted properties you can do something like this
select * from nt:unstructured where text like '%<strong>%'
I tested it , but you need to know the properties you are intererested in.
This is jcr-sql syntax
Start using predicates like a champ this way all of this will make sense to you!
HTML Encode <strong>
HTML Decimal <strong>
Query builder is your friend:
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%3Cstrong%3E%25
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have a go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%26lt%3Bstrong%26gt%3B%25
XPath:
/jcr:root/content/geometrixx//element(*, nt:unstructured)
[
jcr:like(#text, '%<strong>%')
]
SQL2 (already covered... NASTY YUK..)
SELECT * FROM [nt:unstructured] AS s WHERE ISDESCENDANTNODE([/content/geometrixx]) and text like '%<strong>%'
Although I'm sure it's entirely possible with a string of predicates, it's possibly heading down the wrong route. Ideally it would be better to parse the HTML when it is stored or published.
The required information would be stored on simple properties on the node in question. The query will then be a lot simpler with just a property = value query, than lots of overly complex query syntax.
It will probably be faster too.
So if you read in your HTML with something like HTMLClient and then parse it with a OSGI service, that can accurately save these properties for you. Every time the HTML is changed the process would update these properties as necessary. Just some thoughts if your SQL is getting too much.

Getting First and Follow metadata from an ANTLR4 parser

Is it possible to extract the first and follow sets from a rule using ANTLR4? I played around with this a little bit in ANTLR3 and did not find a satisfactory solution, but if anyone has info for either version, it would be appreciated.
I would like to parse user input up the user's cursor location and then provide a list of possible choices for auto-completion. At the moment, I am not interested in auto-completing tokens which are partially entered. I want to display all possible following tokens at some point mid-parse.
For example:
sentence:
subjects verb (adverb)? '.' ;
subjects:
firstSubject (otherSubjects)* ;
firstSubject:
'The' (adjective)? noun ;
otherSubjects:
'and the' (adjective)? noun;
adjective:
'small' | 'orange' ;
noun:
CAT | DOG ;
verb:
'slept' | 'ate' | 'walked' ;
adverb:
'quietly' | 'noisily' ;
CAT : 'cat';
DOG : 'dog';
Given the grammar above...
If the user had not typed anything yet the auto-complete list would be ['The'] (Note that I would have to retrieve the FIRST and not the FOLLOW of rule sentence, since the follow of the base rule is always EOF).
If the input was "The", the auto-complete list would be ['small', 'orange', 'cat', 'dog'].
If the input was "The cat slept, the auto-complete list would be ['quietly', 'noisily', '.'].
So ANTLR3 provides a way to get the set of follows doing this:
BitSet followSet = state.following[state._fsp];
This works well. I can embed some logic into my parser so that when the parser calls the rule at which the user is positioned, it retrieves the follows of that rule and then provides them to the user. However, this does not work as well for nested rules (For instance, the base rule, because the follow set ignores and sub-rule follows, as it should).
I think I need to provide the FIRST set if the user has completed a rule (this could be hard to determine) as well as the FOLLOW set of to cover all valid options. I also think I will need to structure my grammar such that two tokens are never subsequent at the rule level.
I would have break the above "firstSubject" rule into some sub rules...
from
firstSubject:
'The'(adjective)? CAT | DOG;
to
firstSubject:
the (adjective)? CAT | DOG;
the:
'the';
I have yet to find any information on retrieving the FIRST set from a rule.
ANTLR4 appears to have drastically changed the way it works with follows at the level of the generated parser, so at this point I'm not really sure if I should continue with ANTLR3 or make the jump to ANTLR4.
Any suggestions would be greatly appreciated.
ANTLRWorks 2 (AW2) performs a similar operation, which I'll describe here. If you reference the source code for AW2, keep in mind that it is only released under an LGPL license.
Create a special token which represents the location of interest for code completion.
In some ways, this token behaves like the EOF. In particular, the ParserATNSimulator never consumes this token; a decision is always made at or before it is reached.
In other ways, this token is very unique. In particular, if the token is located at an identifier or keyword, it is treated as though the token type was "fuzzy", and allowed to match any identifier or keyword for the language. For ANTLR 4 grammars, if the caret token is located at a location where the user has typed g, the parser will allow that token to match a rule name or the keyword grammar.
Create a specialized ATN interpreter that can return all possible parse trees which lead to the caret token, without looking past the caret for any decision, and without constraining the exact token type of the caret token.
For each possible parse tree, evaluate your code completion in the context of whatever the caret token matched in a parser rule.
The union of all the results found in step 3 is a superset of the complete set of valid code completion results, and can be presented in the IDE.
The following describes AW2's implementation of the above steps.
In AW2, this is the CaretToken, and it always has the token type CARET_TOKEN_TYPE.
In AW2, this specialized operation is represented by the ForestParser<TParser> interface, with most of the reusable implementation in AbstractForestParser<TParser> and specialized for parsing ANTLR 4 grammars for code completion in GrammarForestParser.
In AW2, this analysis is performed primarily by GrammarCompletionQuery.TaskImpl.runImpl(BaseDocument).

Ruby, regex, sentences

I'm currently building a code generator, which aims to generate boiler plate for me once I write the templates and/or translations, in whatever language I have to work with.
I have a problem with a regex in Ruby. The regex aims to select whatever is between {{{ and }}}, so I can generate functions according to my needs.
My regex is currently :
/\{\{\{(([a-zA-Z]|\s)+)\}\}\}/m
My test data set is:
{{{Demande aaa}}} => {{{tagadatsouintsouin tutu}}}
The results are:
[["Demande aaa", "a"], ["tagadatsouintsouin tutu", "u"]]
Each time the regex picks the last character twice. That's not exactly what I want, I need something more like this:
/\{\{\{((\w|\W)+)\}\}\}/m
But this has a flaw too, the results are:
[["Demande aaa}}} => {{{tagadatsouintsouin tutu", "u"]]
Whereas, I wish to get:
[["Demande aaa"],["tagadatsouintsouin tutu"]]
How do I correct these regexes? I could use two sets of delimiters, but it won't teach me anything.
Edit :
All your regex run against my data sample, so you all got a point.
Regex may be overkill, and probably are overkill for my purpose. So i have two questions.
First, do the regex keep the same exact indentation ? This should be able to handle whole functions.
Second, is there something fitter for that task ?
Detailled explanation of the purpose of this tool. I'm bored to write boiler plate code in php - symfony. So i wish to generate this according to templates.
My intent is to build some views, some controllers, and even parts of model this way.
Pratical example : In my model, i wish to generate some functions according to the type of an object's attribute. For examples, i have functions displaying correctly money. So i need to build the corect function, according to my attribute, and then put in , inside m output file.
So there is some translations which themselves need translations.
So a fictive example :
{{{euro}}} => {{{ function getMyAttributeEuro()
{
return formating($this->get[[MyAttribute]]);
} }}}
In order to stock my translations, should i use regex, like
I wish to build something a bit clever, so it can build most of the basic code with no bug. So i can work on interesting code.
You have one set of capturing parentheses too many.
/\{\{\{([a-zA-Z\s]+)\}\}\}/
Also, you don't need the /m modifier because there is no dot (.) in your regex whose behaviour would be affected by it.
I'm partial to:
data = '{{{Demande aaa}}} => {{{tagadatsouintsouin tutu}}}'
data.scan(/\{{3}(.+?)}{3}/).flatten.map{ |r| r.squeeze(' ') }
=> ["Demande aaa", "tagadatsouintsouin tutu"]
or:
data.scan(/\{{3}(.+?)}{3}/).flatten.map{ |r| [ r.squeeze(' ') ] }
=> [["Demande aaa"], ["tagadatsouintsouin tutu"]]
or:
data.scan(/\{{3}(.+?)}{3}/).map{ |r| [ r[0].squeeze(' ') ] }
=> [["Demande aaa"], ["tagadatsouintsouin tutu"]]
if you need the sub-arrays.
I'm not big on trying to everything possible inside the regex. I prefer to keep it short and sweet, then polish the output once I've found what I was looking for. It's a maintenance issue, because regex make my head hurt, and I stopped thinking of them as a macho thing years ago. Regex are a very useful tool, but too often they are seen as the answer to every problem, which they're not.
Some people, when confronted with a problem, think “I know,
I'll use regular expressions.” Now they have two problems.
-- Jamie Zawinski
You want non capturing groups (?:...), but here is another way.
/\{\{\{(.*?)\}\}\}/m
Just a shot
/\{\{\{([\w\W]+?)\}\}\}/
Added non-greedyness to your regex
Here this seems to work

Resources