Early exit for the rules - stanford-nlp

I am using stanford nlp library.
I have a file which contains 1000 of rules.
As of now for a given text it goes through the all the rules.
So, lets say I have Rule 1 .. Rule 100. And for a given input Rule 2 was matched. The search does not stop here. It looks for all the rules.
Is there any way, if a rule is matched, then it stops the search instead of looking at all the rules always?

This is typically done by using a break statement, so you'd have, in pseudocode:
rule = 0;
while (ruleNotFound) {
if (rule > totalNumRules)
{
// searched all rules but could not find it
return false;
}
else if (checkRule(rule))
{
// found
ruleNotFound = false;
theRule = rule;
}
rule++;
}

Related

What is the most efficient way to replace a list of words without touching html attributes?

I absolutely disagree that this question is a duplicate! I am asking for an efficiency way to replace hundreds of words at once. This is an algorithm question! All the provided links are about to replace one word. Should I repeat that expensive operation hundreds of times? I'm sure that there are better ways as a suffix tree where I sort out html while building that tree. I removed that regex tag since for no good reason you are focusing on that part.
I want to translate a given set of words (more then 100) and translate them. My first idea was to use a simple regular expression that works better then expected. As sample:
const input = "I like strawberry cheese cake with apple juice"
const en2de = {
apple: "Apfel",
strawberry: "Erdbeere",
banana: "Banane",
/* ... */}
input.replace(new RegExp(Object.keys(en2de).join("|"), "gi"), match => en2de[match.toLowerCase()])
This works fine on the first view. However it become strange if you words which contains each other like "pineapple" that would return "pineApfel" which is totally nonsense. So I was thinking about checking word boundaries and similar things. While playing around I created this test case:
Apple is a company
That created the output:
Apfel is a company.
The translation is wrong, which is somehow tolerable, but the link is broken. That must not happen.
So I was thinking about extend the regex to check if there is a quote before. I know well that html parsing with regex is a bad idea, but I thought that this should work anyway. In the end I gave up and was looking for solutions of other devs and found on Stack Overflow a couple of questions, all without answers, so it seems to be a hard problem (or my search skills are bad).
So I went two steps back and was thinking to implement that myself with a parser or something like that. However since I have multiple inputs and I need to ignore the case I was thinking what the best way is.
Right now I think to build a dictionary with pointers to the position of the words. I would store the dict in lower case so that should be fast, I could also skip all words with the wrong prefix etc to get my matches. In the end I would replace the words from the end to the beginning to avoid breaking the indices. But is that way efficiency? Is there a better way to achieve that?
While my sample is in JavaScript the solution must not be in JS as long the solution doesn't include dozens of dependencies which cannot be translated easy to JS.
TL;DR:
I want to replace multiple words by other words in a case insensitive way without breaking html.
You may try a treeWalker and replace the text inplace.
To find words you may tokenize your text, lower case your words and map them.
const mapText = (dic, s) => {
return s.replace(/[a-zA-Z-_]+/g, w => {
return dic[w.toLowerCase()] || w
})
}
const dic = {
'grodzi': 'grodzila',
'laaaa': 'forever',
}
const treeWalker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT
)
// skip body node
let currentNode = treeWalker.nextNode()
while(currentNode) {
const newS = mapText(dic, currentNode.data)
currentNode.data = newS
currentNode = treeWalker.nextNode()
}
p {background:#eeeeee;}
<p>
grodzi
LAAAA
</p>
The link stay untouched.
However mapping each word in an other language is bound to fail (be it missing representation of some word, humour/irony, or simply grammar construct). For this matter (which is a hard problem on its own) you may rely on some tools to translate data for you (neural networks, api(s), ...)
Here is my current work in progress solution of a suffix tree (or at least how I interpreted it). I'm building a dictionary with all words, which are not inside of a tag, with their position. After sorting the dict I replace them all. This works for me without handling html at all.
function suffixTree(input) {
const dict = new Map()
let start = 0
let insideTag = false
// define word borders
const borders = ' .,<"\'(){}\r\n\t'.split('')
// build dictionary
for (let end = 0; end <= input.length; end++) {
const c = input[end]
if (c === '<') insideTag = true
if (c === '>') {
insideTag = false
start = end + 1
continue
}
if (insideTag && c !== '<') continue
if (borders.indexOf(c) >= 0) {
if(start !== end) {
const word = input.substring(start, end).toLowerCase()
const entry = dict.get(word) || []
// save the word and its position in an array so when the word is
// used multiple times that we can use this list
entry.push([start, end])
dict.set(word, entry)
}
start = end + 1
}
}
// last word handling
const word = input.substring(start).toLowerCase()
const entry = dict.get(word) || []
entry.push([start, input.length])
dict.set(word, entry)
// create a list of replace operations, we would break the
// indices if we do that directly
const words = Object.keys(en2de)
const replacements = []
words.forEach(word => {
(dict.get(word) || []).forEach(match => {
// [0] is start, [1] is end, [2] is the replacement
replacements.push([match[0], match[1], en2de[word]])
})
})
// converting the input to a char array and replacing the found ranges.
// beginning from the end and replace the ranges with the replacement
let output = [...input]
replacements.sort((a, b) => b[0] - a[0])
replacements.forEach(match => {
output.splice(match[0], match[1] - match[0], match[2])
})
return output.join('')
}
Feel free to leave a comment how this can be improved.

Inline if with else

I realize that it's not valid ruby but what would be the technical hurdles to implement the below functionality into the Ruby core language (of say v2.3)?
variable = 1 if condition else -1
I'd also like to allow the following for more generic use.
variable = { 1 } if condition else { -1 }
I'm very open to requiring an "end" at the end.
I get that a ternary can easily accomplish this but I'm looking for a more readable inline-if that allows an else.
I get that I can make a function which does this via any number of styles but I'd prefer to have it as readable as possible.
Thanks.
EDIT: I hate editing questions for obvious reasons.
In response to the question of how the generic option is more ruby-esque, see the below example (I needed newlines).
variable = {
operation_one
operation_two
...
SUCCESS_STATUS_CODE
} if loaded_dependencies else {
do_detailed_logging
FAILURE_STATUS_CODE
}
if variable then
it_worked
else
explain_why
end
Because your example, while it seems readable to you, has too many ambiguities in other cases.
Not to mention that ruby has a way to do this, and it's the ternary operator. To say that your example is more ruby-like, is almost like wondering why the wheelbase of the Ford Mustang wasn't longer, and that it would be more "Mustang-like" if it was.
But here are some issues with your proposal, starting from your example:
variable = { 1 } if condition else { -1 }
Here you've given your "if else" bit a lower precedence than the "=".
In other words:
variable = ({ 1 } if condition else { -1 })
That's a serious problem, because it breaks the currently allowed:
variable = 1 if condition
The precedence for that statement is:
(variable = 1) if condition
And that's important. No assignment happens if the condition is false.
This can be a really big deal, for example if the lvalue (left side) actually has side-effects. For example:
var[0] = 1 if condition
The lookup for "var[0]" is a method in whatever class object var is, and while [] doesn't usually have side-effects, it can - and now you are going to do those side effects even if the condition is false.
And I won't even get into:
variable = { 1 } if condition if condition2 else { -1 }
But if you don't like it, you can always write your own language and see what happens!

Huge number of calls to ANTLR38BitConsume library function

I'm writing a lexer for ASN1 using ANTLR3 for C target, using the version 3.4 of the library. Actually the lexer is really slow, so I performed a perf record on the execution, and I found that the bottleneck is the library function ANTLR38BitConsume, which is pretty simple.
static void
antlr38BitConsume(pANTLR3_INT_STREAM is)
{
pANTLR3_INPUT_STREAM input;
input = ((pANTLR3_INPUT_STREAM) (is->super));
if ((pANTLR3_UINT8)(input->nextChar) < (((pANTLR3_UINT8)input->data) + input->sizeBuf))
{
/* Indicate one more character in this line
*/
input->charPositionInLine++;
if ((ANTLR3_UCHAR)(*((pANTLR3_UINT8)input->nextChar)) == input->newlineChar)
{
/* Reset for start of a new line of input
*/
input->line++;
input->charPositionInLine = 0;
input->currentLine = (void *)(((pANTLR3_UINT8)input->nextChar) + 1);
}
/* Increment to next character position
*/
input->nextChar = (void *)(((pANTLR3_UINT8)input->nextChar) + 1);
}
}
After an investigation, I have figured out that on a file with size 1.4KB, this function, which should be called at most one per bytes I think, it's called 24.5M times. Do you know if this is a known issue or if there is another explanation for this weird behaviour?
EDITED
After some trials, I have figured out the rules which cause this issue. I have some rules in order to recognize specific object identifiers, which are very simple:
OID1 : {counter==6}?=> 2 3 4 5 840 {counter=0;}
and a rule which matches every byte in the content field of ASN1 encoding:
VALUE : ({counter>0}?=> '\u0000'..'\u00FF' {counter--;})+
Since the lexer uses Longest Matching rule, OID1 will never be matched, because VALUE can match an arbitrary long value. Hence, in order to fool the lexer, I edit OID1 rule in this way:
OID1 : {counter==6}?=> 2 3 4 5 840 {counter=0;} VALUE?
The VALUE token at the end of the rule will never be matched, since that rule is active only when counter is greater than 0, but in this way the lexer will consider also OID1 in the matching, because it can be longer than VALUE.
However, this rule caused that huge number of calls to BitConsume function! As soon as I delete the VALUE? part, I got a number of calls precisely equal to the number of bytes of input file.
I think that this is an implementation bug of ANTLR, since it should try to match the input following OID1 with VALUE, but it should immediately stop and skip the token because counter is 0.

Parsing script loops in C#

I'm writing an application that will parse a script in a custom language (based slightly on C syntax and Allman style formatting) and am looking for a better (read: faster) way of parsing blocks of the script code into string arrays than the way I'm currently doing it (the current method will do, but it was more for debug than anything else).
The script contents are currently read from a file into a string array and passed to the method.
Here's a script block template:
loop [/* some conditional */ ]
{
/* a whole bunch of commands that are to be read into
* a List<string>, then converted to a string[] and
* passed to the next step for execution */
/* some command that has a bracket delimited set of
* properties or attributes */
{
/* some more commands to be acted on */
}
}
Basically, the curly bracket blocks can be nested (just like in any other C-based language), and I'm looking for the best way to find individual blocks like this.
The curly bracket delimited blocks will ALWAYS be formatted like this - the contents of the brackets will start on the line after the open bracket and will be followed by a bracket on the line after the final attribute/command/comment/whatever.
An example might be:
loop [ someVar <= 10 ]
{
informUser "Get ready to do something"
readValue
{
valueToLookFor = 0x54
timeout = 10 /* in seconds */
}
}
This would tell the app to loop whilst someVar is less than 10 (sorry for the sucking eggs comment). Each time round, we pass a message to the user and look for a specific value from somewhere (with a timeout of 10 seconds).
Here's how I'm doing it at the minute (note: the method that calls this passes the entire string[] containing the current script into it with an index to read from):
private string[] findEntireBlock(string[] scriptContents, int indexToReadFrom,
out int newIndex)
{
newIndex = 0;
int openBraceCount = 0; // for '{' char count
int closeBraceCount = 0; // for '}' char count
int openSquareCount = 0; // for '[' char count
int closeSquareCount = 0; // for ']' char count
List<string> fullblock = new List<string>();
for (int i = indexToReadFrom; i < scriptContents.Length; i++)
{
if (scriptContents[i].Contains('}'))
{
if (scriptContents[i].Contains("[") && fullblock.Count > 0)
{
//throw new exception, as we shouldn't expect to
//to find a line which starts with [ when we've already
}
else
{
if (scriptContents[i].Contains('{')) openBraceCount++;
if (scriptContents[i].Contains('}')) closeBraceCount++;
if (scriptContents[i].Contains('[')) openSquareCount++;
if (scriptContents[i].Contains(']')) closeBraceCount++;
newIndex = i;
fullblock.Add(scriptContents[i]);
break;
}
}
else
{
if (scriptContents[i].Contains("[") && fullblock.Count > 0)
{
//throw new exception, as we shouldn't expect to
//to find a line which starts with [ when we've already
}
else
{
if (scriptContents[i].Contains('{')) openBraceCount++;
if (scriptContents[i].Contains('}')) closeBraceCount++;
if (scriptContents[i].Contains('[')) openSquareCount++;
if (scriptContents[i].Contains(']')) closeBraceCount++;
fullblock.Add(scriptContents[i]);
}
}
}
if (openBraceCount == closeBraceCount &&
openSquareCount == closeSquareCount)
return fullblock.ToArray();
else
//throw new exception, the number of open brackets doesn't match
//the number of close brackets
}
I agree that this might be a slightly obtuse and slow method to follow, that's why I'm asking for any ideas on how to re-implement this for speed and clarity (if a balance can be met, that is).
I'm looking to stay away from RegEx, because I can't use it to maintain a bracket count and I'm not sure on whether you can write a RegEx statement (is that the correct term?) that can act recursively. I was thinking of working from the inside outward, but I'm convinced that would be quite slow.
I'm not looking for someone to re-write it for me, but a general idea on algorithms or techniques/libraries that I could use that would improve my method.
As a side question, how do compilers deal with multiple nested brackets in source code?
Let's Build a Compiler, by Jack Crenshaw, is a fantastic, easy-to-read introduction to building a basic compiler. The techniques discussed should help with what you're trying to do here.

What is an algorithm/structure that can be used to effectively find if a object matches any of a set of patterns?

A pattern is a hash with values and functions. For example:
pattern = {a:1,b:2,c:function(x){ return x<5; }}
There is a function that checks if an object matches a pattern. For example, an object will match the pattern above if obj.a == 1, obj.b == 2 and obj.c < 5. Some tests:
matches(pattern,{a:1,b:3,c:2}) == false // because b != 2
matches(pattern,{a:1,b:2,c:7}) == false // because c >= 5
matches(pattern,{a:1,b:2,c:3}) == true //fine
matches(pattern,{a:1,b:2,c:2,d:4}) == true //no problems in having extras
Suppose that I have a set of patterns and I want to find if an object matches any of those patterns. I could check one by one, but, this way, I have an O(n) complexity, where n is the number of patterns. I have a feeling that this can be optimized if I use the set of patterns to build some other structure; but I'm not sure what that structure could be. Thoughts?
You can create a decision tree (Or optimize it to BDD data structure). This requires reading each relevant variable only once during evaluation of each object.
A BDD is a way to evaluate a logical formula, in your case the logical formula is
pattern_1 OR pattern_2 OR pattern_3 OR .... OR pattern_n

Resources