We are using Stanford NER to train our own (CRF) classifier for French newspaper texts.
We are having problems with punctuation, in particular Stanford NER seems to replace some punctuation marks by others.
Here is an example where the ' in "aujourd'hui" is replaced by ` and the « and » that enclose Ave-Maria are replaced by `` and ".
Input raw text:
" Aujourd'hui ... « Ave Maria » et ..."
Stanford NER output:
word | tag | begin-offset | end-offset
Aujourd | O | 31 | 38
` | O | 38 | 39
hui | O | 39 | 42
`` | O | 331 | 332
Ave | O | 333 | 336
Maria | O | 337 | 342
'' | O | 343 | 344
We have tested the following flags when creating the classifier:
-outputFormatOptions includePunctuationDependencies
-inputEncoding utf-8
-outputEncoding utf-8
but none has worked.
I would appreciate any help.
Here is an example command tokenizing French text with the French tokenizer:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-french.properties -file example-french-sentence-one.txt -outputFormat text
Note the tokenize property:
tokenize.language = fr
This will tell the tokenizer to use the French tokenizer.
That should handle the case of Aujourd'hui but unfortunately the guillemets are hard coded to be converted to " in the French lexer, and no option changes that behavior.
If I get a chance I'll try to push a change to the French tokenizer that sets that behavior as optional.
You can provide already tokenized text to a pipeline with the option tokenize.whitespace and just providing each token split by whitespace if you have another method to tokenize your text before submitting it to Stanford CoreNLP. Otherwise you might want to process your training data to match the way Stanford CoreNLP will tokenize it, that could be another option.
Related
I have the following table which contains millions of documents data in the form of a json file:
+-------+---------------------------------------+------------+
| doc_id| doc_text | doc_lang |
+-------+---------------------------------------+------------+
| doc1 | "first /resource X 'title' " | en |
| doc2 | "<r>ressource 2 #titre en France" | Fr |
| doc3 | "die Tür geöffnet?" | ge |
| doc4 | "$risorsa 4 <in> lingua italiana" | It |
| ... | " ........." | .. |
| ... | "........." | .. |
+-------+---------------------------------------+------------+
I need to do the following:
Tokenizing, filtering and stopwords removing for each document text using an appropriate analyzer (dynamically) according to the text language shown in doc_lang field (let's say European languages).
Getting TF and IDF for each term inside doc_text field.(no search operations are required, just for scoring)
Q) Could anybody advice me if Elasticsearch is a good choice in this case?
P.S. I am looking for something compatible with Apache Spark.
Include language code in the doc_text field when indexing like
{ "doc_id": "doc", "doc_text_en": "xxx", "doc_lang": "en"}
Then you will be able to specify dynamic mapping of lang-specific analyzer.
https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html
I'm currently trying to test a service that should be properly replacing certain special characters within a certain Unicode range, including emojis, transportation icons, emoticons and dingbats. I have been using Cucumber and Ruby to do the testing, and my latest scenario outline won't work. I've tried looking up other ways of getting the character from the examples table, however I can't seem to get it working, and the cucumber printout just complains that the Given step isn't defined.
Here is my feature scenario:
Scenario Outline: I update a coupon with a name including emojis/emoticons/dingbats/symbols
Given I have a name variable with a <character> included
When I patch my coupon with this variable
Then the patch should succeed
And the name should include replacement characters
Examples:
| character |
| 🐳 |
| 🍅 |
| 😥 |
| 🚀 |
| ☀ |
| ☂ |
| ✋ |
| ✂ |
| ⨕ |
| 𐊀 |
| 𦿰 |
And Here is my step definition for the Given (which is the step that is complaining that it isn't defined)
Given(/^I have a name variable with a (\w+) included$/) do |char|
#name = 'min length ' + char
#json = { 'name' => #name }.to_json
end
I've tried using some regex's to capture the character, and a (\w+) and (\d+), although I can't find information on how to capture the special character. It's possible for me to write 11 different step definitions, but that would be such poor practice it would drive me nuts.
Unless you have spaces in your specials, it’s safe to use non-space \S:
Given(/^I have a name variable with a (\S+) included$/) do |char|
...
\w would not give you the desired result, since \w is resolved to [a-zA-Z0-9_].
I am trying to write a Boolean expression grammar that can treat WHITE_SPACE as an implicit logic AND. e.g., "A B" means "A AND B".
However, I would also like to treat the US-formatted phone number as a single toke, e.g., (123) 456-7890. My grammar can cover most cases, but is still facing grammar ambiguity on the AREA_CODE.
Here is my grammar:
grammar myBooleanExpr;
options
{
language = Java;
output = AST;
}
tokens {
AND;
}
fragment DIGIT : '0'..'9';
fragment AREA_CODE : LPAREN DIGIT+ RPAREN;
fragment NUMBER : ( DIGIT | '-' )+;
LPAREN : '(' ;
RPAREN : ')' ;
WS : ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; };
L_AND: 'AND'| 'And' | 'and';
OR : 'OR' | 'Or' | 'or';
NOT : 'NOT' | 'Not' | 'not';
NAME : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"') )*;
PHONE : AREA_CODE ' '? NUMBER?;
QUOTED_NAME : '"'.*'"';
expression : orexpression;
orexpression : andexpression (OR^ andexpression)*;
andexpression : notexpression (L_AND? notexpression)* -> ^(AND notexpression+);
notexpression : NOT^ atom | atom;
atom : NAME | PHONE | QUOTED_NAME | LPAREN! orexpression RPAREN!;
Input vs. Expected Output:
(123) 456-7890 -> (123) 456-7890 // single token
(123) abc -> 123 AND abc // two tokens
(123456) 789 -> 123456 AND 789 // two tokens ### currently
failed
(12 34) -> 12 AND 34 // two tokens ### currently
failed
(123) 456-aaaa -> 123 AND 456-aaaa // two tokens ### currently
failed
abc efg AND hij -> abc AND efg AND hij // three tokens
It is very difficult for me to understand the usage of input.LA(1) or so. Very appreciated if someone could jump in helping me on this issue.
I think you are trying to put too much into lexer rules. Parsing telephone numbers like that needs more flexibility, e.g. a single space char might not be enough and what about tabs? Instead you should lex all individual tokens (numbers, punctuation etc.) as usual and do a semantic check once you have the syntax tree from the parser run.
It's up to you to decide if a the space between two tokens is just that or can be interpreted as logical operation (here AND). Neither the parser nor the lexer can know that, it depends on the context. This is why you cannot make that grammar free of ambiquities.
I'm trying to use a regular expression to solve a reverse polish calculator problem, but I'm having issues with converting the mathematical expressions into conventional form.
I wrote:
puts '35 29 1 - 5 + *'.gsub(/(\d*) (\d*) (\W)/, '(\1\3\2)')
which prints:
35 (29-1)(+5) *
expected
(35*((29-1)+5))
but I'm getting a different result. What am I doing wrong?
I'm assuming you meant you tried
puts '35 29 1 - 5 + *'.gsub(/(\d*) (\d*) (\W)/, '(\1\3\2)')
^ ^
Anyway, you have to use the quantifier + instead of *, since otherwise you will match an empty string for \d* as one of your captures, hence the (+5):
/(\d+) (\d+) (\W)/
I would further extend/constrain the expression to something like:
/([\d+*\/()-]+)\s+([\d+*\/()-]+)\s+([+*\/-])/
| | | | |
| | | | Valid operators, +, -, *, and /.
| | | |
| | | Whitespace.
| | |
| | Arbitrary atom, e.g. "35", "(29-1)", "((29-1)+5)".
| |
| Whitepsace.
|
Arbitrary atom, e.g. "35", "(29-1)", "((29-1)+5)".
...and instead of using gsub, use sub in a while loop that quits when it detects that no more substitutions can be made. This is very important because otherwise, you will violate the order of operations. For example, take a look at this Rubular demo. You can see that by using gsub, you might potentially replace the second triad of atoms, "5 + *", when really a second iteration should substitute an "earlier" triad after substituting the first triad!
WARNING: The - (minus) character must appear first or last in a character class, since otherwise it will specify a range! (Thanks to #JoshuaCheek.)
Basically questions says it all, how can I convert an xml file to yaml?
I've tried this:
require 'active_support/core_ext/hash/conversions'
require 'yaml'
file = File.open("data/mconvert.xml", "r")
hash = Hash.from_xml(file.read)
yaml = hash.to_yaml
File.open("data/mirador.yml", "w") { |file| file.write(yaml) }
But, I am getting an "Exception parsing" error. I thought that was because I had dashes in an xml tag name, so I replaced the dashes with dashcharacterr But that still didn't work.
If we have a look at the XML 1.0 specification, we'll see that start tags look like this:
[40] STag ::= '<' Name (S Attribute)* S? '>'
and then elsewhere, we find the definition of Name:
[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
You'll notice that - is not in NameStartChar so this:
<-vikings->1336162202</-vikings->
is not valid XML and this part of your code:
hash = Hash.from_xml(file.read)
is failing because your file doesn't contain XML, it contains text that looks like XML but isn't quite real XML.
Fix your data/mconvert.xml file to contain real XML and try again.
If you try a simple experiment in the Rails console, you'll see what's going on:
> Hash.from_xml('<-vikings->1336162202</-vikings->')
REXML::ParseException: #<REXML::ParseException: malformed XML: missing tag start
Line: 1
Position: 33
Last 80 unconsumed characters:
<-vikings->1336162202</-vikings->>
notice the "malformed XML: missing tag start"?