Changing tokenisation behaviour - stanford-nlp

I'm using Stanford Tokenizer for my project and I'm unable to understand and fix tokenization of a specific pattern.
Based on my configuration, if I tokenize string:
"Hello World>"
I will correctly get:
hello
world
>
But for string:
"<Hello World>"
I'm getting:
<hello world>
And I expect to receive:
<
hello
world
>
Any idea if I could configure the tokenizer in a way to not consider this specific pattern as a single token?
These are my current options set for the tonezer:
-lowerCase -options "untokenizable=allKeep,americanize=true,normalizeOtherBrackets=false,normalizeParentheses=false"
Help is much appreciated.

Related

How to escape a doublequote-sign inside a (tau) prolog string?

I guess there must be an easy answer to this, but I just haven't been able to find it -
I want to include double-quote signs inside tau-prolog strings.. how to do it?
When I try entering it into the Tau-Prolog sandbox (http://tau-prolog.org/sandbox/) I get the following interaction:
A = "hello there, \"world\"".
error parsing query: error(syntax_error(unknown_escape_sequence),[line(1),column(4),found('"hello there, \\"')])
Any ideas?
Where did I (stupidly) misunderstand things? :)
PS - it's not only a problem in the sandbox - if try to run a Tau-Prolog program (in the browser), and use a double-quote character anywhere inside a string I get the same parsing-problem.

JISON: How do I avoid "dog" being parsed as "do"?

I have the following JISON file (lite version of my actual file, but reproduces my problem):
%lex
%%
"do" return 'DO';
[a-zA-Z_][a-zA-Z0-9_]* return 'ID';
"::" return 'DOUBLECOLON'
<<EOF>> return 'ENDOFFILE';
/lex
%%
start
: ID DOUBLECOLON ID ENDOFFILE
{$$ = {type: "enumval", enum: $1, val: $3}}
;
It is for parsing something like "AnimalTypes::cat". It works fine for things like "AnimalTypes::cat", but the when it sees dog instead of cat, it asumes it's a DO instead of an id. I can see why it does that, but how do I get around it? I've been looking at other JISON documents, but can't seem to spot the difference that (I assume) makes those work.
This is the error I get:
JisonParserError: Parse error on line 1:
PetTypes::dog
----------^
Expecting "ID", "enumstr", "id", got unexpected "DO"
Repro steps:
Install jison-gho globally from npm (or modify code to use local version). I use Node v14.6.0.
Save the JISON above as minimal-repro.jison
Run: jison -m es -o ./minimal.mjs ./minimal-repro.jison to create parser
Create a file named test.mjs with code like:
import Parser from "./minimal.mjs";
Parser.parser.parse("PetTypes::dog")
Run node test.mjs
Edit: Updated with a reproducible example.
Edit2: Simpler JISON
Unlike (f)lex, the jison lexer accepts the first matching pattern, even if it is not the longest matching pattern. You can get the (f)lex behaviour by using
%option flex
However, that significantly slows down the scanner.
The original jison automatically added \b to the end of patterns which ended with a literal string matching an alphabetic character, to make it easier to match keywords without incurring this overhead. In jison-gho, this feature was turned off unless you specify
%option easy_keyword_rules
See https://github.com/zaach/jison/wiki/Deviations-From-Flex-Bison#user-content-literal-tokens.
So either of those options will achieve the behaviour you expect.

Here document gives EOF error in Ruby IO

The following code give two errors which I am not able to resolve. Any help would be appreciated:
random.rb:10: can't find string "TEMPLATE" anywhere before EOF
random.rb:3: syntax error, unexpected end-of-input
Code:
id = 2
File.open("#{id}.json","w") do |file|
file.write <<TEMPLATE
{
"submitter":"#{hash["submitter"]}",
"quote":"#{hash["quote"]}",
"attribution":"#{hash["attribution"]}"
}
TEMPLATE
end
From the documentation (emphasis mine):
The heredoc starts on the line following <<HEREDOC and ends with the next line that starts with HEREDOC
Your code doesn't contain a line starting with TEMPLATE. If your text editor (or IDE) supports regular expressions in searches, try ^TEMPLATE.
You can either remove the spaces or if you want to keep them, change <<TEMPLATE into <<-TEMPLATE. The addition of - instructs the Ruby parser to search for an (possibly) intended TEMPLATE like you have in your code.

"Unrecognized character \xE2" in a Hello World program

I am trying to write my first perl "hello world" program on Mac OS X Yosemite and it shows this error when I try to run this using terminal:
Unrecognized character \xE2; marked by <-- HERE after
print <-- HERE
near column 7 at test.pl line 4.
I couldn't figure out what was wrong in this program. Please help me out here.
Code:
#!/usr/bin/perl
use strict;
use warnings;
print “Hello world”;
Change the “” character in the print statement to "
Example
print "Hello world";
Make sure syntax like this ' should be proper. Check your perl file for syntax errors
perl -c testfile.pl
While it is not directly connected to this case, there's also a different situation when \xE2 error can appear, which can seem not obvious. One can also have a zero-width space in their string, which can also raise this error.
I couldn't see this character in notepad or notepad++, but I could see it in vim as <200b>. This character can be placed next to { and } characters when copying stuff from for example Microsoft Teams.
This link appears as the first one when searching for this kind of problem, so I thought it might be a good idea to post the solution here.

Ruby regex and special characters like dash (—) and »

I'm trying to replace all punctuation and the likes in some text with just a space. So I have the line
text = "—Bonne chance Harry murmura t il »"
How can I remove the dash and the dash and »? I tried
text.gsub( /»|—/, ' ')
which gives an error, not surprisingly. I'm new to ruby and just trying to get a hang of things by writing a script to pull all the words out of a chapter of a book. I figure I'd just remove the punctuation and symbols and just use text.split. Any help would be appreciated. I couldn't find much
It turns out the problem had to do with the utf-8 encoding. Adding
# encoding: utf-8
solved my issues and what #Andrewlton said works great
This should properly substitute in the way you were trying to do it; just add brackets and remove the pipe:
text.gsub(/[»—]/, ' ')
The standard punctuation regexp also works:
text.gsub(/\p{P}/, ' ')
You should be able to use regexp pretty universally, coming from whatever language you know. Hope this helps!

Resources