stanfordnlp differences 3.9.2 -> 4.0.0 changelg - stanford-nlp

Changed to UDv2 tokenization (“new” LDC Treebank,for English); handles multi-word-tokens; improved UDv2-based taggers and parsers for English, French, German, Spanish; new French NER;new Chinese segmenter; library updates, bug fixes
https://stanfordnlp.github.io/CoreNLP/history.html
Big changes in CoreNLP 4: no LDC-style escaping of tokens (no more -LRB-), change English tokenization to that of “new” LDC treebanks and UD (mainly split hyphens), use UDv2 dependencies and POS for English, French, German, and Spanish (have obj and obl)
https://twitter.com/stanfordnlp/status/1252657192764796928
Can someone provide a detailed/informative/useable summary of the changes at the interface and at the functional level? If it helps narrow the scope:
English models
Annotation requires()
CoreAnnotations.TextAnnotation.class
CoreAnnotations.TokensAnnotation.class
CoreAnnotations.SentencesAnnotation.class
CoreAnnotations.PartOfSpeechAnnotation.class
CoreAnnotations.LemmaAnnotation.class
CoreAnnotations.NamedEntityTagAnnotation.class
CoreAnnotations.NormalizedNamedEntityTagAnnotation.class
CoreAnnotations.CanonicalEntityMentionIndexAnnotation.class
CorefCoreAnnotations.CorefChainAnnotation.class
SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class
CoreAnnotations.QuotationsAnnotation.class
Pipeline properties
props.setProperty("coref.algorithm", "statistical");
props.setProperty("coref.maxMentionDistance", "15");
props.setProperty("coref.maxMentionDistanceWithStringMatch", "50");
props.setProperty("coref.statisical.pairwiseScoreThresholds", ".15");
props.setProperty("pos.maxlen", "70");
props.setProperty("ner.maxlen", "70");
props.setProperty("ner.applyFineGrained", "false");
props.setProperty("ner.useSUTime", "true");
props.setProperty("ner.applyNumericClassifiers", "true");
props.setProperty("ner.combinationMode", "NORMAL");
props.setProperty("ner.additional.regexner.mapping", "regexner.txt");
props.setProperty("quote.maxLength", "70");
props.setProperty("quote.singleQuotes", "true");
props.setProperty("quote.asciiQuotes", "true");
props.setProperty("quote.attributeQuotes", "true");
props.setProperty("enforceRequirements", "true");
props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");
props.setProperty("parse.maxlen", "70");

First off, we could not honestly say that CoreNLP has strictly used semantic versioning. Call what it uses "impressionistic versioning", if you will.
That said, we definitely decided to move to v4 to reflect the fact that the upgrade will break downstream code for most users.
However, things are subtle with NLP software. There are no big changes in the APIs, and while there are some changes in Properties, such as for the tokenizer, most properties have not changed either, and all the ones you have above look fine. Indeed, if your pipeline downstream is details-agnostic, such as an all-ML pipeline, CoreNLP 4.0 probably won't break anything and you can just plug it in.
However, for most people, they will have made assumptions about details of the tokenization or the taxonomies of parts of speech, dependencies, etc., and then code changes will definitely be required.
E.g., if your existing code assumes details of English tokenization, such that parentheses will be rendered '-LRB-', '-RRB-' or that quotes will be turned into ASCII sequences like '``', then your code will need changes. If you assume that your Spanish tagger/parser will provide a "simplified AnCora" part-of-speech tag set rather than a UD tag set, then your code will need changes. If you assume that in dependency analyses that the dependency 'dobj' will mark direct objects (UDv1) then you will need to update some patterns or whatever to look for 'obj' instead (UDv2). Etc.
There is a more detailed set of release notes here: https://github.com/stanfordnlp/CoreNLP/releases/tag/v4.0.0 . But there isn't any precise record of all changes. Except the GitHub history....

Related

Clarify steps to add a language variant to Stanza

I would like to add a non-standard variant of a language already supported by Stanza. It should be named differently from the standard variety included in the common distribution of Stanza. I could use a modification of the corpus for training the AI, since the changes are mostly morphological rather than syntactical, but how many steps would I need to take in order to make a new language variety for Stanza from this background? I don't understand what data are input and what are output in the process of adding a new language in the web documentation.
It sounds like you are trying to add a different set of processors rather than a whole new language. The difference being that other steps of the pipeline will still work the same, right? NER models, for example.
If that's the case, if you can follow the steps to retrain the current models, you should be able to then replace the input data with your morphological updates.
I suggest filing an issue on github if you encounter difficulties in the process. It will be a lot easier to back & forth there.
Times when we would actually recommend a whole new language are when 1) it's actually a new language or 2) it uses a different character set - think different writing systems for ZH or for Punjabi, if we had any Punjabi models

Stanford NLP core 4.0.0 no longer splitting verbs and pronouns in Spanish

Very helpfully Stanford NLP core 3.9.2 used to split rolled together Spanish verbs and pronouns
This is the 4.0.0 output:
The previous version had more .tagger files. These have not been included with the 4.0.0 distribution.
Is that the cause. Will be they added back?
There are some documentation updates that still need to be made for Stanford CoreNLP 4.0.0.
A major change is that a new multi-word-token annotator has been added, that makes tokenization conform with the UD standard. So the new default Spanish pipeline should run tokenize,ssplit,mwt,pos,depparse,ner. It may not be possible to run such a pipeline from the server demo at this time, as some modifications will need to be made. I can try to send you what such modifications would be soon. We will try to make a new release in early summer to handle issues like this that we missed.
It won't split the word in your example unfortunately, but I think in many cases it will do the correct thing. The Spanish mwt model is just based off of a large dictionary of terms, and was tuned to optimize performance on the Spanish training data.

NLP Postagger can't grok imperatives?

Stanford NLP postagger claims imperative verbs added to recent version. I've inputted lots of text with abundant and obvious imperatives, but there seems to be no tag for them on output. Must one, after all, train it for this pos?
There is no special tag for imperatives, they are simply tagged as VB.
The info on the website refers to the fact that we added a bunch of manually annotated imperative sentences to our training data such that the POS tagger gets more of them right, i.e. tags the verb as VB.

Why do people use plain english as translation placeholders?

This may be a stupid question, but here goes.
I've seen several projects using some translation library (e.g. gettext) working with plain english placeholders. So for example:
_("Please enter your name");
instead of abstract placeholders (which has always been my instinctive preference)
_("error_please_enter_name");
I have seen various recommendations on SO to work with the former method, but I don't understand why. What I don't get is what do you do if you need to change the english wording? Because if the actual text is used as the key for all existing translations, you would have to edit all the translations, too, and change each key. Or don't you?
Isn't that awfully cumbersome? Why is this the industry standard?
It's definitely not proper normalization to do it this way. Are there massive advantages to this method that I'm not seeing?
Yes, you have to alter the existing translation files, and that is a good thing.
If you change the English wording, the translations probably need to change, too. Even if they don't, you need someone who speaks the other language to check.
You prep a new version, and part of the QA process is checking the translations. If the English wording changed and nobody checked the translation, it'll stick out like a sore thumb and it'll get fixed.
The main language is already existent: you don't need to translate it.
Translators have better context with a real sentence than vague placeholders.
The placeholders are just the keys, it's still possible to change the original language by creating a translation for it. Because when the translation doesn't exists, it uses the placeholder as the translated text.
We've been using abstract placeholders for a while and it was pretty annoying having to write everything twice when creating a new function. When English is the placeholder, you just write the code in English, you have meaningful output from the start and don't have to think about naming placeholders.
So my reason would be less work for the developers.
I like your second approach. When translating texts you always have the problem of homonyms. Like 'open' can mean a state of a window but also the verb to perform the action. In other languages these homonyms may not exist. That's why you should be able to add meaning to your placeholders. Best approach is to put this meaning in your text library. If this is not possible on the platform the framework you use, it might be a good idea to define a 'development language'. This language will add meaning to the text entries like: 'action_open' and 'state_open'. you will off course have to put extra effort i translating this language to plain english (or the language you develop for). I have put this philosophy in some large projects and in the long run this saves some time (and headaches).
The best way in my opinion is keeping meaning separate so if you develop your own translation library or the one you use supports it you can do something like this:
_(i18n("Please enter your name", "error_please_enter_name"));
Where:
i18n(text, meaning)
Interesting question. I assume the main reason is that you don't have to care about translation or localization files during development as the main language is in the code itself.
Well it probably is just that it's easier to read, and so easier to translate. I'm of the opinion that your way is best for scalability, but it does just require that extra bit of effort, which some developers might not consider worth it... and for some projects, it probably isn't.
There's a fallback hierarchy, from most specific locale to the unlocalised version in the source code.
So French in France might have the following fallback route:
fr_FR
fr
Unlocalised. Source code.
As a result, having proper English sentences in the source code ensures that if a particular translation is not provided for in step (1) or (2), you will at least get a proper understandable sentence than random programmer garbage like “error_file_not_found”.
Plus, what do you do if it is a format string: “Sorry but the %s does not exist” ? Worse still: “Written %s entries to %s, total size: %d” ?
Quite old question but one additional reason I haven't seen in the answers yet:
You could end up with more placeholders than necessary, thus more work for translators and possible inconsistent translations. However, good editors like Poedit or Gtranslator can probably help with that.
To stick with your example:
The text "Please enter your name" could appear in a different context in a different template (that the developer is most likely not aware of and shouldn't need to be). E.g. it could be used not as an error but as a prompt like a placeholder of an input field.
If you use
_("Please enter your name");
it would be reusable, the developer can be unaware of the already existing key for an error message and would just use the same text intuitively.
However, if you used
_("error_please_enter_name");
in a previous template, developers wouldn't necessarily be aware of it and would make up a second key (most likely according to a predefined wording scheme to not end up in complete chaos), e.g.
_("prompt_please_enter_name");
which then has to be translated again.
So I think that doesn't scale very well. A pre-agreed wording scheme of suffixes/prefixes e.g. for contexts can never be as precise as the text itself I think (either too verbose or too general, beforehand you don't know and afterwards it's difficult to change) and is more work for the developer that's not worth it IMHO.
Does anybody agree/disagree?

Steps to develop a multilingual web application

What are the steps to develop a multilingual web application?
Should i store the languages texts and resources in database or should i use property files or resource files?
I understand that I need to use CurrentCulture with C# alone with CultureFormat etc.
I wanted to know you opinions on steps to build a multilingual web application.
Doesn't have to be language specific. I'm just looking for steps to build this.
The specific mechanisms are different depending on the platform you are developing on.
As a cursory set of work items:
Separation of code from content. Generally, resources are compiled into assemblies with the help of resource files (in dot net) or stored in property files (in java, though there are other options), or some other location, and referred to by ID. If you want localization costs to be reasonable, you need to avoid changes to the IDs between releases, as most localization tools will treat new IDs as new content.
Identification of areas in the application which make assumptions about the locale of the user, especially date/time, currency, number formatting or input.
Create some mechanism for locale-specific CSS content; not all fonts work for all languages, and not all font-sizes are sane for all languages. Don't paint yourself into a corner of forcing Thai text to be displayed in 8 pt. Also, text directionality is going to be right-to-left for at least two languages.
Design your page content to reflow or resize reasonably when more or less content than you expect is present. Many languages expand 50-80% from English for short strings, and 30-40% for longer pieces of content (that's a rough rule of thumb, not a law).
Identify cultural presumptions made by your UI designers, and try to make them more neutral, or, if you've got money and sanity to burn, localizable. Mailboxes don't look the same everywhere, hand gestures aren't universal, and something that's cute or clever or relies on a visual pun won't necessarily travel well.
Choose appropriate encodings for your supported languages. It's now reasonable to use UTF-8 for all content that's sent to web browsers, regardless of language.
Choose appropriate collation for your databases, or enable alternate collations, if you are dealing with content in multiple languages in your databases. Case-insensitivity works differently in many languages than it does in English, and accent insensitivity is acceptable in some languages and generally inappropriate in others.
Don't assume words are delimited by spaces or that sentences are delimited by punctuation, if you're trying to support search.
Avoid:
Storing localized content in databases, unless there's a really, really, good reason. And then, think again. If you have content that is somewhat dynamic and representatives of each region need to customize it, it may be reasonable to store certain categories of content with an associated locale ID.
Trying to be clever with string concatenation. Also, try not to assume rules about pluralization or counting work the same for every culture. Make sure, at least, that the order of strings (and controls) can be specified with format strings that are typical your platform, or well documented in your localization kit if you elect to roll your own for some reason.
Presuming that it's ok for code bugs to be fixed by localizers. That's generally not reasonable, at least if you want to deliver your product within a reasonable time at a reasonable cost; it's sometimes not even possible.
The first step is to internationalize. The second step is to localize. The third step is to translate.

Resources