First of all, kudos to Angel Chang for writing such a great tool as TokensRegex!
My use case is the following:
I have two extraction rules in my test rule set. Both of them have the "action" field specified as the outcome and both have "Annotate" in the action list.
They work just fine when the second rule's expression to match is independent of the first rule's outcome. But when the second rule execution depends on the outcome of the first rule, things break down.
A specific example:
I have the following sentence: "The consensus estimates called for EPS of $3.55 on revenues of $30.51 billion."
"EPS" and "revenues" are already annotated by a more basic RegexNER annotator. The goal of TokensRegex annotator is to augment the NER annotations if certain conditions are met.
In this simplified example if we see a term "estimate(s)" occuring shortly before the term "EPS", we want to re-tag the token "EPS" with the "DN-EPS_EST" NER annotation. That's my first rule.
The second rule is dependent on the result of the first rule - re-annotate the token "revenues" if it is preceeded by a token with the NER annotation of "DN-EPS_EST" (which can only be the outcome of the first rule).
So my TokensRegex rules are the following:
{
ruleType: "tokens",
pattern: ( /[Ee]stimates?/ []{0,3} [{ner:"DN-EPS"}] ),
action: ( Annotate($0[-1], "ner", "DN-EPS_EST") ) }
{
ruleType: "tokens",
pattern: ( [{ner:"DN-EPS_EST"}] /of/ [{ner:"MONEY"}]{1,3} /on/ [{ner:"DN-REVENUE"}] ),
action: ( Annotate($0[-1], "ner", "DN-REVENUE_EST") ) }
The first rule works, but the second doesn't. What could the problem be? Are the rules executed in a wrong order? Are the results of the first rule not persisted on time for being matched in the second expression? Am I using wrong fields or action type? I intentionally simplified the pattern matching expressions in this example, but maybe I still have an error in the "pattern" field of the second rule?
Any help at all will be greatly appreciated! I am stumped. Read all the documentation on the website, the Javadocs and the slides even, but just can't find a specific answer.
OK, after some additional tinkering and research I finally found the answer to my own question:
You have to apply the chained rules in stages, just ordering them "correctly" in the rules file is not sufficient.
TokensRegexAnnotator will DO NOTHING based on the dependent rule if its pattern mentions a token property that is being modified by the upstream rule and if the stage is the same (or unspecified). It will match neither the "before the 1st rule execution" state, nor the "after the 1st rule execution" state.
I tested the 2nd rule by itself by taking the 1st rule out of the equation altogether - it worked. This was necessary to ensure that the pattern expression was not faulty in the 2nd rule.
Then I re-introduced the 1st rule and tested the 2nd rule with two expressions: "before the 1st rule execition" state and "after the 1st rule execution" state - NOTHING HAPPENED IN BOTH CASES. Not sure why TokensRegexAnnotator was implemented this way, maybe the creators thought that no behavior is better than some default behavior...
At any rate, only after I read deeper into the "SequenceMatchRules" Javadoc, I found the "stage" field and attempted to apply it (although it does not say explicitly that you HAVE to apply it if you have a rule that uses output annotations from some other rule).
Here's how the working example looks like:
{ ruleType: "tokens",
pattern: ( /[Ee]stimates?/ []{0,3} [{ner:"DN-EPS"}] ),
action: ( Annotate($0[-1], "ner", "DN-EPS_EST") ),
stage: 1 }
{ ruleType: "tokens",
pattern: ( [{ner:"DN-EPS_EST"}] /of/ [{ner:"MONEY"}]{1,3} /on/ [{ner:"DN-REVENUE"}] ),
action: ( Annotate($0[-1], "ner", "DN-REVENUE_EST") ),
stage: 2 }
As you can see, the 2nd rule's pattern has a condition on an NER annotation that can be satisfied only after the 1st rule is executed and results are committed. In this example the 2nd rule is fired, as expected.
Related
I'm new with Neo4J and i have the following query, how i can make it faster? It really takes a long time
MATCH (vintage:Vintage)-[:MADE_FROM]->(wine:Wine)
OPTIONAL MATCH (vintage)-[:DESIGNATED_BY]->(app:Appellation)
OPTIONAL MATCH (vintage)-[:RANKED]->(ranking:Ranking)
OPTIONAL MATCH (vintage)-[:HAS_NOTE]->(note:Note)<-[:REVIEWS]-(reviewer:Reviewer)
WITH reviewer, note, app, wine, vintage ORDER BY vintage.code ASC, vintage.year DESC
RETURN { vintages: collect({ uid: vintage.uid, year: vintage.year,
cv: vintage.referencePrice})[10 * (1 - 1)..10 * 1], total: size(collect(vintage)) } as vintage
Explain model
It's a little strange that you're getting all this optional info (app, ranking, note, reviewer) but you're not doing anything with it. You're not returning any of that, you're only returning information from the vintage.
If you're not going to use it, don't match to it.
But if your intent is to make sure at least one of such paths exist per pattern, use a WHERE clause for these, that will also avoid cross product issues if multiple paths exist for some of the OPTIONAL MATCHes.
But since you're using OPTIONAL MATCHes, it seems like you don't want their presence in the graph or absence to control whether or not the vintage is returned. As such, they aren't serving any purpose, and are only making your query slower. The only hard requirement you seem to have is that the :Vintage is made from :Wine (though if all :Vintages are made from :Wine, you can remove that part too). We'll keep that and remove the others.
Also your calculation for the total vintages isn't correct. Since you are using OPTIONAL MATCHes, you don't care about a vintage meeting the patterns there, so we can just total the vintages before MATCHING to the individual nodes.
Lastly it wasn't clear that you're only looking for a subset of nodes, so we can do away with the list slice approach and use SKIP and LIMIT instead. This is also probably why your query is taking so long...it's calculating all vintage information, but you're only taking a slice of it at the end, so you're doing a lot of work that is only thrown away at the end.
Try this instead:
MATCH (vintage:Vintage)
WHERE (vintage)-[:MADE_FROM]->(:Wine)
WITH count(vintage) as total
MATCH (vintage:Vintage)
WHERE (vintage)-[:MADE_FROM]->(:Wine)
WITH total, vintage
ORDER BY vintage.code ASC, vintage.year DESC
SKIP 0
LIMIT 10
WITH total, collect({ uid: vintage.uid, year: vintage.year, cv: vintage.referencePrice}) as vintages
RETURN { vintages:vintages, total:total} as vintage
If there was a purpose to those OPTIONAL MATCHes, please explain in comments, and I'll see how the query can be modified to handle them.
I often find when adding rules to my workflow that I need to split large jobs up into batches. This means that my input/output files will branch out across temporary sets of batches for some rules before consolidating again into one input file for a later rule. For example:
rule all:
input:
expand("final_output/{sample}.counts",sample=config["samples"]) ##this final output relates to blast rule in that it will feature a column defining transcript type
...
rule batch_prep:
input: "transcriptome.fasta"
output:expand("blast_input_{X}.fasta",X=[1,2,3,4,5])
script:"scripts/split_transcriptome.sh"
rule blast:
input:"blast_input_{X}.fasta",
output:"output_blast.txt"
script:"scripts/blastx.sh"
...
rule rsem:
input:
"transcriptome.fasta",
"{sample}.fastq"
output:
"final_output/{sample}.counts"
script:
"scripts/rsem.sh"
In this simplified workflow, snakemake -n would show a separate rsem job for each sample (as expected, from wildcards set in rule all). However, blast would give a WildcardError stating that
Wildcards in input files cannot be determined from output files:
'X'
This makes sense, but I can't figure out a way for the Snakefile to submit separate jobs for each of the 5 batches above using the one blast template rule. I can't make separate rules for each batch, as the number of batches will vary on the size of the dataset. It seems it would be useful if I could define wildcards local to a rule. Does such a thing exist, or is there a better way to solve this issue?
I hope I understood your problem correctly, if not, feel free to correct me:
So, you want to call the rule blast for every "blast_input_{X}.fasta"?
Then, the batch wildcard would need to be carried over into the output.
rule blast:
input:"blast_input_{X}.fasta",
output:"output_blast_{X}.txt"
script:"scripts/blastx.sh"
If you then later want to merge the batches again in another rule, just use expand in the input of that rule.
input: expand("output_blast_{X}.txt", X=your_batches)
output: "merged_blast_output.txt"
While working with Antlr3 grammar, I have come across a situation when the rule's intent is effectively same but behaves differently.
I have created a small example-
I want to parse a qualified object name which may be 3-part or 2-part or unqualified (Dot is the separator).
Test Input-
1. SCH.LIB.TAB1;
2. LIB.TAB1;
3. TAB1;
I changed the below rule from having optionals to having alternatives (ORed rules).
Before State-
qualified_object_name
:
( identifier ( ( DOT identifier )? DOT identifier )? )
;
After State-
qualified_object_name_new
:
( identifier DOT identifier DOT identifier ) // 3 part name
| ( identifier DOT identifier ) // 2 part name
| ( identifier ) // 1 part name
;
Input 1 is parsed correctly by both the rules, but the new rule gives error while parsing input 2 and 3.
line 1:22 no viable alternative at input ';'
I assumed that Antlr will try to match against alternative 1 of qualified_object_name_new, but when does not match alternative 1 fully, then would try to match alternative 2 and so on.
So, for input 'LIB.TAB1' it would finally match against alternative 2 of qualified_object_name_new.
However, it is not working this way and gives error while paring 2-part name or unqualified name.
Interestingly, when I set option k = 1, then all 3 inputs are parsed correctly by the new rule.
But with any other value of k, it gives error.
I want to understand why Antlr behaves this way and is this correct.
You probably have not increased the lookahead size (which is 1 by default in ANTLR3). With one token lookahead the new object name rule cannot resolve the ambiquity (all alts start with the same token). You should have gotten a warning about this too.
You have 3 options to solve this problem with ANTLR3 (though I also recommend to switch to version 4):
Enable backtracking (see the backgtrack option), though I'm not 100% sure if that really helps.
Increase lookahead size (see the k option).
Use syntactic predicates to force ANTLR to look ahead the entire alt.
For more details read the ANTLR3 documentation.
Is it possible to extract the first and follow sets from a rule using ANTLR4? I played around with this a little bit in ANTLR3 and did not find a satisfactory solution, but if anyone has info for either version, it would be appreciated.
I would like to parse user input up the user's cursor location and then provide a list of possible choices for auto-completion. At the moment, I am not interested in auto-completing tokens which are partially entered. I want to display all possible following tokens at some point mid-parse.
For example:
sentence:
subjects verb (adverb)? '.' ;
subjects:
firstSubject (otherSubjects)* ;
firstSubject:
'The' (adjective)? noun ;
otherSubjects:
'and the' (adjective)? noun;
adjective:
'small' | 'orange' ;
noun:
CAT | DOG ;
verb:
'slept' | 'ate' | 'walked' ;
adverb:
'quietly' | 'noisily' ;
CAT : 'cat';
DOG : 'dog';
Given the grammar above...
If the user had not typed anything yet the auto-complete list would be ['The'] (Note that I would have to retrieve the FIRST and not the FOLLOW of rule sentence, since the follow of the base rule is always EOF).
If the input was "The", the auto-complete list would be ['small', 'orange', 'cat', 'dog'].
If the input was "The cat slept, the auto-complete list would be ['quietly', 'noisily', '.'].
So ANTLR3 provides a way to get the set of follows doing this:
BitSet followSet = state.following[state._fsp];
This works well. I can embed some logic into my parser so that when the parser calls the rule at which the user is positioned, it retrieves the follows of that rule and then provides them to the user. However, this does not work as well for nested rules (For instance, the base rule, because the follow set ignores and sub-rule follows, as it should).
I think I need to provide the FIRST set if the user has completed a rule (this could be hard to determine) as well as the FOLLOW set of to cover all valid options. I also think I will need to structure my grammar such that two tokens are never subsequent at the rule level.
I would have break the above "firstSubject" rule into some sub rules...
from
firstSubject:
'The'(adjective)? CAT | DOG;
to
firstSubject:
the (adjective)? CAT | DOG;
the:
'the';
I have yet to find any information on retrieving the FIRST set from a rule.
ANTLR4 appears to have drastically changed the way it works with follows at the level of the generated parser, so at this point I'm not really sure if I should continue with ANTLR3 or make the jump to ANTLR4.
Any suggestions would be greatly appreciated.
ANTLRWorks 2 (AW2) performs a similar operation, which I'll describe here. If you reference the source code for AW2, keep in mind that it is only released under an LGPL license.
Create a special token which represents the location of interest for code completion.
In some ways, this token behaves like the EOF. In particular, the ParserATNSimulator never consumes this token; a decision is always made at or before it is reached.
In other ways, this token is very unique. In particular, if the token is located at an identifier or keyword, it is treated as though the token type was "fuzzy", and allowed to match any identifier or keyword for the language. For ANTLR 4 grammars, if the caret token is located at a location where the user has typed g, the parser will allow that token to match a rule name or the keyword grammar.
Create a specialized ATN interpreter that can return all possible parse trees which lead to the caret token, without looking past the caret for any decision, and without constraining the exact token type of the caret token.
For each possible parse tree, evaluate your code completion in the context of whatever the caret token matched in a parser rule.
The union of all the results found in step 3 is a superset of the complete set of valid code completion results, and can be presented in the IDE.
The following describes AW2's implementation of the above steps.
In AW2, this is the CaretToken, and it always has the token type CARET_TOKEN_TYPE.
In AW2, this specialized operation is represented by the ForestParser<TParser> interface, with most of the reusable implementation in AbstractForestParser<TParser> and specialized for parsing ANTLR 4 grammars for code completion in GrammarForestParser.
In AW2, this analysis is performed primarily by GrammarCompletionQuery.TaskImpl.runImpl(BaseDocument).
I have a table containing millions of transaction records(Obj1) which looks like this
TransactionNum Country ZipCode State TransactionAmount
1 USA 94002 CA 1000
2 USA 00023 FL 1000
I have another table containing Salesreps records(Obj2),again in hundreds of thousands.
SalesrepId PersonNumber Name
Srp001 123 Rohan
Srp002 124 Shetty
I have a few ruleset tables,where basically rules are defined as below
Rule Name : Rule 1
Qualifying criteria : Country = "USA" and (ZipCode = 94002 or State = "FL")
Credit receiving salesreps :
Srp001 gets 70%
Srp002 gets 30%
The qualifying criteria is for the transactions,which means if the transaction attributes match the criteria in the Rule then credits are assigned to the salesreps defined in the rule's credit receiver section.
Now,I need an algorithm which populates a result table as below
ResultId TransactionNumber SalesrepId Credit
1 1 Srp001 700
2 2 Srp002 300
What is the efficient algorithm to do this?
So your real problem is how to quickly match transactions to potential rules. You can do this with an inverted index that says which rules match particular values for the attributes. For example, let's say you have these three rules:
Rule 1: if Country = "USA" and State = "FL"
S1 gets 100%
Rule 2: if Country = "USA" and (State = "CO" or ZIP = 78640)
S2 gets 60%
S3 gets 40%
Rule 3: if Country = "UK"
S3 gets 70%
S2 gets 30%
Now, you process your rules and create output like this:
Country,USA,Rule1
State,FL,Rule1
Country,USA,Rule2
State,CO,Rule2
ZIP,78640,Rule2
Country,UK,Rule3
You then process that output (or you can do it while you're processing the rules) and build three tables. One maps Country values to rules, one maps State values to rules, and one maps ZIP values to rules. You end up with something like:
Countries:
USA, {Rule1, Rule2}
UK, {Rule3}
States:
FL, {Rule1}
CO, {Rule2}
"*", {Rule3}
ZIP:
78640, {Rule2}
"*", {Rule1, Rule3}
The "*" value is a "don't care," which will match all rules that don't specifically mention that field. Whether this is required depends on how you've structured your rules.
The above indexes are constructed whenever your rules change. With 4000 rules, it shouldn't take any time at all, and the list size shouldn't be very large.
Now, given a transaction that has a Country value of "USA", you can look in the Countries table to find all the rules that mention that country. Call that list Country_Rules. Do the same thing for States and ZIP codes.
You can then do a list intersection. That is, build another list called Country_And_State_Rules that contains only those rules that exist in both the Country_Rules and State_Rules lists. That will typically be a small set of possible rules. You could then go through them one-by-one, testing country, state, and ZIP code, as required.
What you're building is essentially a search engine for rules. It should allow you to narrow the candidates from 4,000 to just a handful very quickly.
There are a few problems that you'll have to solve. Having conditional logic ("OR"), complicates things a little bit, but it's not intractable. Also, you have to determine how to handle ambiguity (what if two rules match?). Or, if no rules match the particular Country and State, then you have to back up and check for rules that only match the Country ... or only match the State. That's where the "don't care" comes in.
If your rules are sufficiently unambiguous, then in the vast majority of cases you should be able to pick the relevant rule very quickly. Some few cases will require you to search many different rules for some transactions. But those cases should be pretty rare. If they're frequent, then you need to consider re-examining your rule set.
Once you know which rule applies to a particular transaction, you can easily look up which salesperson gets how much, since the proportions are stored with the rules.