Algorithm to tell when we've processed a complex variable path expression while parsing? - algorithm

I am working on a compiler for a homemade programming language and I am stuck on how to convert the lexical token stream into a tree of commands for constructing a DOM-like tree. The "tree of commands" will still be a list, essentially emmitting events in a way that describes how to create a tree, from partial information provided by the lexer. (This language is like CoffeeScript in a way, indentation based, or like XML with indentation focus).
I am stuck on how to tell when a variable path has been discovered. A variable path can be simple, or complex, as these examples demonstrate:
foo
foo.bar
foo.bar[baz].hello[and][goodday].there
this[is[even[more.complicated].wouldnt.you[say]]]
They could get more complicated still, if we handled dynamic interpolation of strings, such as:
foo[`bar${x}abc`].baz
But in my simple lang, there are two relevant things, "paths", and "terms". Terms are anything /a-z/ for now, and paths are chaining together and nesting, like the first examples.
For demonstration purposes, everything else is a simple "term" of 1 word, so you might have this:
abc foo.bar[baz].hello[and][goodday].there, one foo.bar
It forms a simple tree.
Right now I have a lexer which spits out the tokens, so basically:
abc
[SPACE]
foo
.
bar
[
baz
]
.
hello
[
and
]
[
goodday
]
.
there
,
[SPACE]
one
[SPACE]
foo
.
bar
That is at least how I broke it up initially.
So given that sequence of strings, how can you generate messages to tell the parser how to build a tree?
term
nest-down
term
period
term
open-square
and
close-square
...
That is the stream of tokens with a name now, but it is not a tree yet. I would like this:
term-start
term # value: abc
term-end
nest-down
term-path-start
term-start
term # value: foo
term-end
period
term-start
term # value: bar
term-end
term-nest-start
term-start
term # value: and
term-and
term-nest-end
...
I have been struggling with this example for several days now (boiled down from a complex real-world scenario). I cant seem to figure out how to keep track of all the information you need to make a decision on when to say "this structure is done now, close it out" sort of thing. Wondering if you know how to get past this.
Note, I don't need the last tree to actually be a tree structure visually, I just need it to generate those messages which can be interpreted on the other end and used to construct a tree at runtime.

There is no way to construct a tree from a list without having the description of the tree in some form. Often, in relation to parsing, the description of this tree is given by a context-free grammar (CFG).
Then you create a parser on the basis of this given CFG. The lexical token stream is given as an input to the parser. The parser organizes the lexical tokens into a tree by using some parsing algorithm.
The parser emits commands for syntax tree construction based on the rules it uses during parsing. On entering into a rule a command "rule X enter" is emitted, on exiting a rule a command "exit X rule" is emitted. When you accept a lexical token then a "token forward" is emitted with its lexeme characters. Some grammars, namely these in ABNF format, support repetitions of elements. Depending from these repetitions the syntax tree might be represented as lists or arrays.
Then a builder module receives this commands and builds a tree, or uses the commands for the specific task with the listener pattern.
I have co-authored (2021) a paper describing a list of commands for building a concrete/abstract syntax trees, depending on CFG's structure, that are used in the parsers generated by parser generator Tunnel Grammar Studio.
The paper is named "Тhe Expressive Power of the Statically Typed Concrete Syntax Trees". It is in an open-access journal (intentionally). The commands are in section "4.3 Syntax Structure Construction Commands". The article is bit "compressed", due to space limitations, and it is not really intended to be a software development guide, but to note the taken approach. It might give you some ideas.
Another co-authored paper of mine, from 2021 named "A Parsing Machine Architecture Encapsulating Different Parsing Approaches" (also in open-access journal) describes a general form of parsing machine and its modules. There Fig.1, p.33, will give you a quick description.
Disclaimer: I have made the parser generator.

Related

Rascal: Grammar Stack Trace

When parsing a file with a specific grammar and the parse fails, I get a corresponding error message with the location in the source file that offended the grammar.
What I would like to look at in these situations would be the list of grammar rules that were active at this moment, something like a grammar rule "stack trace", or the rules that have matched so far.
Is this possible in Rascal?
So, for a very simple example, in the EXP language from the documentation, if I tried to parse "2 + foo" I could get something like
Exp
=> left Exp "+" Exp
=> left IntegerLiteral "+" Exp
=> left IntegerLiteral "+" <?>
No derivation of "foo" from rule 'Exp'
Another way of saying this is looking at an incomplete parse tree, as it was the moment the parse error occurred. Does that make sense?
It makes total sense, but I'm afraid this "incomplete parse tree" feature is on our TODO list.
Note that with the non-deterministic parsing algorithm it would probably return a set of current parse contexts, so a "parse forest" rather than a single stack trace. Still I think that would be a very useful debugging feature.
The only suggestion right now I can do right now is "delta-debugging", so removing half the input and checking if the parse error is still there, then the other half, rinse/lather/repeat.

fuzzy logic for query-based document summarisation in Python

I am trying to use fuzzy logic to weight and extract the best sentences for the query. I have extracted the following features which they can be used in fuzzy logic:
Each sentence has cosine value.
How many proper-noun is in the sentence.
the position of the sentence in the document.
sentence length.
I want to use the above features to apply the fuzzy logic. for instance, i want to create the rule base something like the following
if cosineValue >= 0.9 && numberOfPropernoun >=1
THEN the sentence is important
I am not quite sure how to start implementing the rule base, the facts and inference engine. It would like someone to guide me to implement this in python. Please note that I am not familiar with logic programming languages. I would like to implement it in python
This is just a sketch; I'm not even going to try this code because I'm not sure what you want.
Make a class for your features:
Features = namedtuple('Features', ['cosine', 'nouns', 'position', ...])
etc.
Now imagine you are building your AST. What grammar does your language have? Well, you have conditions, and your conditions have consequences, and your conditions can be combined by boolean operators, so let's make some basic ones:
class CosineValue(object):
def evaluate(self, features):
return features.cosine
class Nouns(object):
def evaluate(self, features):
return features.nouns
... etc.
Now you need to combine these AST nodes with some operations
class GreaterThan(object):
def __init__(self, property, value):
self.property, self.value = property, value
def evaluate(self, sentence):
return property.evaluate(sentence) > self.value
Now GreaterThan(CosineValue(), 0.9) is an object (an abstract syntax tree, actually) that represents cosineValue > 0.9. You can evaluate it like so:
expr = GreaterThan(CosineValue(), 0.9)
expr.evaluate(Features(cosine=0.95, ...)) # returns True
expr.evaluate(Features(cosine=0.40, ...)) # returns False
These objects don't look like much, but what they are doing is reifying your process. Their structure encodes what formerly would have been code. Think about this, because this is the only hard part about what you are trying to do: comprehending how you can delay computation by turning it into structure, and how you can play with when values become part of your computation. You were probably stuck thinking about how to write those "if" statements and keeping them separate from the code and the runtime values you need to run them against. Now you should be able to see how, but it's a more advanced way of thinking about programming.
Now you need to build your if/then structure. I'm not sure what you want here either but I would say your if/then is going to be a class that takes an expression like we've just created as one argument and a "then" case, and does the test and either performs or does not perform the "then" case. Probably you will need if/then/else, or else a way to track if it fired, or a way to evaluate your if into a value. You will have to think about this part; nobody can tell you based on what you wrote above what you should be doing.
To make your conditions more powerful, you will need to add some more classes for boolean operators that take conditions as arguments, but it should be straightforward; you'll have And and Or, they'll both take two Condition arguments and their evaluation will do the sensible thing. You could make a Condition superclass, and then add some methods like And and Or to simplify generating these structures.
Finally, if you want to parse something like what you have above, you should try out pyparsing, but make sure you have the AST figured out first or it will be an uphill battle. Or look at what they have; maybe they have some primitives for this, I haven't dealt with pyparsing in a long time.
Best of luck, and please ask a better question next time!

Ontologies, OWL, Sparql: Modelling that "something is not there" and performance considerations

we want to model that "something is not there" as opposed to missing information, e.g. an explicit statement that "a patient did not get chemotherapy" or that "a patient does not have dyspnea" is different from missing information about whether a patient has dyspnea.
We thought about several approaches, e.g.
Using a negation class: "No_Dyspnea". But that seems semantically problematic, since what type would that class be? It cannot be a descendant of the "Dyspnea" class.
Using "not there" object properties, e.g. "denies" or "does_not_have" and then an individual of the Dyspnea root class as the object of that triple.
Using blank nodes that describe that the individual belongs to the group of things that do not have dyspnea. E.g.:
dat:PatientW2 a [ rdf:type owl:Class;
owl:complementOf [
rdf:type owl:Restriction ;
owl:onProperty roo:has_finding;
owl:someValuesFrom nci:Dyspnea;
]
] .
We feel like the 3rd option is the most "ontologically correct" way of expressing this. However, when playing around with it we encountered severe performance problems in simple scenarios.
We are using Sesame with an OWLIM-Lite store and imported the NCI thesaurus (280MB, about 80,000 concepts) and another very small ontology into the store and added two individuals, one having that complementOf/restriction class.
The following query took forever to execute and I terminated it after 15 minutes:
select *
where {
?s a [ rdf:type owl:Class;
owl:complementOf [
rdf:type owl:Restriction ;
owl:onProperty roo:has_finding;
owl:someValuesFrom nci:Dyspnea;
]
] .
} Limit 100
Does anybody know why? I would assume that this approach creates a lot of blank nodes and the query engine has to go through the entire NCI thesaurus and compare all blank nodes with this one?
If I put this triple in a separate graph and only query that graph, the query returns the result instantaneously.
To sum things up. The two basic questions are:
Is the third approach really the best for modelling "something is not there"
Is this going to affect query performance?
EDIT 1
We discussed the proposed options. It actually helped us in clarifying what we are really trying to achieve:
We want to be able to state that "Patient has Dyspnea" or "Patient does not have Dyspnea" at a particular point in time.
In the future there may/will be more information about that patient, e.g. that he/she now has dyspnea.
We want to be able to write Sparql queries that ask for "all patients that have dyspnea" and "all patients that do not have dyspnea".
We want to keep the Sparql as simple and intuitive as possible. E.g. only use one property "has_finding" rather than having to know about two properties (one for "has_exclusion"). Or having to know about some complex blank node construct.
We played around with options:
Negative Property Assertions: This sounded like the best solution to this problem since we are stating that one individual is not related to another individual on that property. The issues are that we have to create an individual of Dyspnea for the sake of having something as owl:targetIndividual. And we cannot find a way of querying the negative assertion easily other then going through the whole owl:sourceIndividual and owl:targetIndividual chain. Which makes the Sparql quite lengthy and puts a burden on the person writing the query to know about it.
Blank node with complementOf: We would be stating something with this that we do not want to state. This would state that "Patient1 can never have a finding of dyspnea". Whereas we want to state the "Patient1 does not have a dyspnea finding now (or at date X)". So we should not use this approach.
Using an Exclusion/Inclusion Types (Option 1 and 2): After a closer look a Jeen's suggestion we believe that using general :Exclusion and :Inclusion classes along with only one property has_finding and giving the dyspnea individual the inclusion/exclusion type is the easiest to understand, query and provides enough reasoning abilities. Example:
:Patient1 a :Patient .
:Dyspnea1 a :Dyspnea .
:Dyspnea1 a :Exclusion.
:Patient1 ex:has_finding :Dyspnea1 .
That way, the person writing the Sparql query only has to know that:
There is one property has_finding, which represents the intentions properly. Since "No dyspnea" is technically a finding as well.
But just querying using has_finding will not give sufficient information about whether the person actually has it or not. The query also needs to contain a triple about whether the dyspnea individual is a :Exclusion (or inclusion depending on the goal of the query).
While this puts some additional burden on the query writer, it is less than negative property assertions and easier to understand.
We would really appreciate some feedback on these conclusions!
If your diseases are represented as individuals, then you can use negative object property assertions to literally say, e.g.,
¬hasFinding(john,Dyspnea)
NegativeObjectPropertyAssertion(hasFinding john Dyspnea)
Of course, if you have lots of things that aren't the case, then this might get a bit involved. It's probably the most semantically correct, though. It also means that your query could match directly against the data in the ontology, which might make for quicker results. (Of course, you'd still have the issues of trying to infer when the negative object property holds.)
This doesn't work if diseases are represented as classes, though. If diseases are represented by classes, then you can use class expressions, similar to what you propose. E.g.,
(∀ hasFinding.¬Dyspnea)(john)
ClassAssertion(ObjectAllValuesFrom(hasFinding ObjectComplementOf(Dyspnea)) john)
This is similar to your third option, but I wonder if it might perform better. It seems like a slightly more direct way of saying what you're trying to say (i.e., if someone has a disease, it's not one of these diseases).
I do agree with Jeen's answer, though; there's a lot of subjectivity here, and a great deal of getting it "right" is actually just a matter of finding something that's reasonable to work with, performs well enough for you, and that seems not entirely unnatural.
With respect to the modeling question, I'd like to offer a fourth alternative, which is, in fact, a mix of your options 1 and 2: introduce a separate class (hierarchy) for these 'excluded/missing' symptoms, diseases or treatments, and have the specific exclusions as instances:
:Exclusion a owl:Class .
:ExcludedSymptom rdfs:subClassOf :Exclusion .
:ExcludedTreatment rdfs:subClassOf :Exclusion .
:excludedDyspnea a :ExcludedSymptom .
:excludedChemo a :ExcludedTreatment .
:Patient a owl:Class ;
owl:equivalentClass [ a owl:Restriction ;
owl:onProperty :excluded ;
owl:allValuesFrom :Exclusion ] .
// john is a patient without Dyspnea
:john a :Patient ;
:excluded :excludedDyspnea .
Optionally, you can link the exclusion instances semantically with the treatment/symptom/diseases:
:excludedDyspnea :ofSymptom :Dyspnea .
In my view, this is just as "ontologically correct" (this kind of thing is quite subjective to be honest) as your other options, and possibly a lot easier to maintain, query, and indeed reason with.
As for your second question: while I can't speak for the behavior of the particular reasoner you're using, in general any construction involving complementOf is computationally very heavy, but perhaps more importantly, it probably does not capture what you intend.
OWL has an open world assumption, which (in broad terms) means that we cannot decide a certain fact is untrue simply because that fact is currently unknown. Your complementOf construction will logically be an empty class, because for any individual X, even if we currently do not know that X has been diagnosed with Dyspnea, there is a possibility that in the future that fact will become known, and therefore X will not be in the complement class.
EDIT
In response to your edit, with the proposal using a single :hasFinding property, I think that generally looks good, though I would perhaps modify it slightly:
:patient1 a :Patient;
:hasFinding :dyspneaFinding1 .
:dyspneaFinding1 a :Finding ;
:of :Dyspnea ;
:conclusion false .
You have now separated the 'finding' as a concept a bit more cleanly from the symptom/treatment that it is a finding of. Also, whether or not the finding is positive or negative is explicitly modeled (rather than implied by the presence/absense of an 'excluded' property or a 'Exclusion' type).
(As an aside: since we link an individual with a class here via a non-typing relation (... :of :Dyspnea) we must rely on OWL 2 punning to make this valid in OWL DL)
To query for a patient with a finding (whether positive or negative) about Dyspnea:
SELECT ?x
WHERE {
?x a :Patient;
:hasFinding [ :of :Dyspnea ] .
}
And to query for patients with confirmed absense of Dyspnea:
SELECT ?x
WHERE {
?x a :Patient;
:hasFinding [ :of :Dyspnea ;
:conclusion false ] .
}

Pythonesque blocks and postfix expressions

In JavaScript,
f = function(x) {
return x + 1;
}
(5)
seems at a glance as though it should assign f the successor function, but actually assigns the value 6, because the lambda expression followed by parentheses is interpreted by the parser as a postfix expression, specifically a function call. Fortunately this is easy to fix:
f = function(x) {
return x + 1;
};
(5)
behaves as expected.
If Python allowed a block in a lambda expression, there would be a similar problem:
f = lambda(x):
return x + 1
(5)
but this time we can't solve it the same way because there are no semicolons. In practice Python avoids the problem by not allowing multiline lambda expressions, but I'm working on a language with indentation-based syntax where I do want multiline lambda and other expressions, so I'm trying to figure out how to avoid having a block parse as the start of a postfix expression. Thus far I'm thinking maybe each level of the recursive descent parser should have a parameter along the lines of 'we have already eaten a block in this statement so don't do postfix'.
Are there any existing languages that encounter this problem, and how do they solve it if so?
Python has semicolons. This is perfectly valid (though ugly and not recommended) Python code: f = lambda(x): x + 1; (5).
There are many other problems with multi-line lambdas in otherwise standard Python syntax though. It is completely incompatible with how Python handles indentation (whitespace in general, actually) inside expressions - it doesn't, and that's the complete opposite of what you want. You should read the numerous python-ideas thread about multi-line lambdas. It's somewhere between very hard to impossible.
If you want arbitrarily complex compound statements inside lambdas you can't use the existing rules for multi-line expressions even if you made all statements expressions. You'd have to change the indentation handling (see the language reference for how it works right now) so that expressions can also contain blocks. This is hard to do without breaking perfectly fine Python code, and will certainly result in a language many Python programmers will consider worse in several regards: Harder to understand, more complex to implement, permits some stupid errors, etc.
Most languages don't solve this exact problem at all. Most candidates (Scala, Ruby, Lisps, and variants of these three) have explicit end-of-block tokens. I know of two languages that have the same problem, one of which (Haskell) has been mentioned by another answer. Coffeescript also uses indentation without end-of-block tokens. It parses the transliteration of your example correctly. However, I could not find any specification of how or why it does this (and I won't dig through the parser source code). Both differ significantly from Python in syntax as well as design philosophy, so their solution is of little (if any) use for Python.
In Haskell, there is an implicit semicolon whenever you start a line with the same indentation as a previous one, assuming the parser is in a layout-sensitive mode.
More specifically, after a token is encountered that signals the start of a (layout-sensitive) block, the indentation level of the first token of the first block item is remembered. Each line that is indented more continues the current block item; each line that is indented the same starts a new block item, and the first line that is indented less implies the closure of the block.
How your last example would be treated depends on whether the f = is a block item in some block or not. If it is, then there will be an implicit semicolon between the lambda expression and the (5), since the latter is indented the same as the former. If it is not, then the (5) will be treated as continuing whatever block item the f = is a part of, making it an argument to the lamda function.
The details are a bit messier than this; look at the Haskell 2010 report.

Treetop grammar infinite loop

I have had some ideas for a new programming language floating around in my head, so I thought I'd take a shot at implementing it. A friend suggested I try using Treetop (the Ruby gem) to create a parser. Treetop's documentation is sparse, and I've never done this sort of thing before.
My parser is acting like it has an infinite loop in it, but with no stack traces; it is proving difficult to track down. Can somebody point me in the direction of an entry-level parsing/AST guide? I really need something that list rules, common usage etc for using tools like Treetop. My parser grammer is on GitHub, in case someone wishes to help me improve it.
class {
initialize = lambda (name) {
receiver.name = name
}
greet = lambda {
IO.puts("Hello, #{receiver.name}!")
}
}.new(:World).greet()
I asked treetop to compile your language into an .rb file. That gave me something to dig into:
$ tt -o /tmp/rip.rb /tmp/rip.treetop
Then I used this little stub to recreate the loop:
require 'treetop'
load '/tmp/rip.rb'
RipParser.new.parse('')
This hangs. Now, isn't that interesting! An empty string reproduces the behavior just as well as the dozen-or-so-line example in your question.
To find out where it's hanging, I used an Emacs keyboard macro to edit rip.rb, adding a debug statement to the entry of each method. For example:
def _nt_root
p [__LINE__, '_nt_root'] #DEBUG
start_index = index
Now we can see the scope of the loop:
[16, "root"]
[21, "_nt_root"]
[57, "_nt_statement"]
...
[3293, "_nt_eol"]
[3335, "_nt_semicolon"]
[3204, "_nt_comment"]
[57, "_nt_statement"]
[57, "_nt_statement"]
[57, "_nt_statement"]
...
Further debugging from there reveals that an integer is allowed to be an empty string:
rule integer
digit*
end
This indirectly allows a statement to be an empty string, and the top-level rule statement* to forever consume empty statements. Changing * to + fixes the loop, but reveals another problem:
/tmp/rip.rb:777:in `_nt_object': stack level too deep (SystemStackError)
from /tmp/rip.rb:757:in `_nt_compound_object'
from /tmp/rip.rb:1726:in `_nt_range'
from /tmp/rip.rb:1671:in `_nt_special_literals'
from /tmp/rip.rb:825:in `_nt_literal_object'
from /tmp/rip.rb:787:in `_nt_object'
from /tmp/rip.rb:757:in `_nt_compound_object'
from /tmp/rip.rb:1726:in `_nt_range'
from /tmp/rip.rb:1671:in `_nt_special_literals'
... 3283 levels...
Range is left-recursing, indirectly, via special_literals, literal_object, object, and compound_object. Treetop, when faced with left recursion, eats stack until it pukes. I don't have a quick fix for that problem, but at least you've got a stack trace to go from now.
Also, this is not your immediate problem, but the definition of digit is odd: It can either one digit, or multiple. This causes digit* or digit+ to allow the (presumably) illegal integer 1________2.
I really enjoyed Language Implementation Patterns by Parr; since Parr created the ANTLR parser generator, it's the tool he uses throughout the book, but it should be simple enough to learn from it all the same.
What I really liked about it was the way each example grew upon the previous one; he doesn't start out with a gigantic AST-capable parser, instead he slowly introduces problems that need more and more 'backend smarts' to do the job, so the book scales well along with the language that needs parsing.
What I wish it covered in a little more depth is the types of languages that one can write and give advice on Do's and Do Not Do's when designing languages. I've seen some languages that are a huge pain to parse and I'd have liked to know more about the design decisions that could have been made differently.

Resources