next release of Stanza - stanford-nlp

I'm interested in the Stanza constituency parser for Italian.
In https://stanfordnlp.github.io/stanza/constituency.html it is said that a new release with updated models (including an Italian model trained on the Turin treebank) should have been available in mid-November.
Any idea about when the next release of Stanza will appear?
Thanks
alberto

Technically you can already get it! If you install the dev branch of stanza, you should be able to download an IT parser.
pip install git+git://github.com/stanfordnlp/stanza.git#704d90df2418ee199d83c92c16de180aacccf5c0
stanza.download("it")
It's trained on the Turin treebank, which has about 4000 trees. If you download the Bert version of the model, it gets over 91 F1 on the Evalita test set (but has a length limit of about 200 words per sentence).
We might splurge on getting the VIT treebank or something. I've been agitating that we use that budget on Danish or PT or some other language where we have very few users, but it's a hard sell...
Edit: there's also some scripts included for converting the publicly available Turin trees into brackets. Their MWT annotation style was to repeat the MWT twice in a row, which doesn't doesn't work too well for a task like parsing raw text.

It is still very much a live task ... either December or January, I would say.
p.s. This isn't really a great SO question....

Related

How do I get the asdoc output table to show both the variable labels and value labels in Stata?

I'm trying to make a table using asdoc that will include both the value labels and the variable labels in the output. When I run the following line of code in Stata
asdoc list progname progtype progterm publicprivate cohortsize grereq, label
I get this in the console (no variable labels):
But in the word doc, it comes out looking like this (has variable labels but no value labels in the table cells):
How do I get both the variable and value labels in the table?
The last update of asdoc was on April 10, 2021. I announced the following in that update.
It is now almost three years that I have been developing asdoc and constantly adding features to it. With the addition of _docx and xl() classes to Stata, it is high time to add support for native docx and xlsx output to asdoc. Also, given that there exists a significant number of LaTeX users, asdoc should be able to create LaTeX documents. It gives me immense pleasure to announce asdocx that is not only more flexible in making customized tables, but also creates documents in native docx, xlsx, html, and tex formats. If you have enjoyed and find asdoc useful, please consider buying a copy of asdocx to support its development. Details related to asdocx can be found on this page.
I am still committed to fixing bugs / issues in asdoc. However, I think it makes more sense to me to add features to asdocx than asdoc, given that asdocx supports all latest developments in Word, Excel and LaTeX.
The requested feature is already available in asdocx. See the following example.
sysuse nlsw88
asdocx list industry age race married grade south in 1/20, replace label

Stanford NLP core 4.0.0 no longer splitting verbs and pronouns in Spanish

Very helpfully Stanford NLP core 3.9.2 used to split rolled together Spanish verbs and pronouns
This is the 4.0.0 output:
The previous version had more .tagger files. These have not been included with the 4.0.0 distribution.
Is that the cause. Will be they added back?
There are some documentation updates that still need to be made for Stanford CoreNLP 4.0.0.
A major change is that a new multi-word-token annotator has been added, that makes tokenization conform with the UD standard. So the new default Spanish pipeline should run tokenize,ssplit,mwt,pos,depparse,ner. It may not be possible to run such a pipeline from the server demo at this time, as some modifications will need to be made. I can try to send you what such modifications would be soon. We will try to make a new release in early summer to handle issues like this that we missed.
It won't split the word in your example unfortunately, but I think in many cases it will do the correct thing. The Spanish mwt model is just based off of a large dictionary of terms, and was tuned to optimize performance on the Spanish training data.

Corenlp server shows different NER result from local 3.8 version

I uses the sentense
He died in the day before yesterday.
to process corenlp NER.
On the server, I got the result like this.
And in local, I uses the same sentence, got the result of
He(O) died(O) in(O) the(O) day(TIME) before(O) yesterday(O) .(O)
So, how can I get the same result like the server?
In order to increase the likelihood of getting a relevant answer, you may want to rephrase your question and provide a bit more information. And as a bonus, in the process of doing so, you may even find out the answer yourself ;)
For example, what url are you using to get your server result? When I check here: http://nlp.stanford.edu:8080/ner/process , I can select multiple models for English. Not sure which version their API is based on (would say the most recent stable version, but I don't know). Then the title of your post suggests you are using 3.8 locally, but it wouldn't hurt to specify the relevant piece in your pom.xml file, or the models you downloaded yourself.
What model are you using in your code? How are you calling it? (i.e. any other annotators in your pipeline that could be relevant for NER output)
Are you even calling it from code (if so, Java? Python?), or using it from the command line?
A lot of this is summarised in https://stackoverflow.com/help/how-to-ask and it's not that long to read through ;)

Generate PDF from content in Magnolia CMS

We would like to generate a PDF document for a single page. While only this link talks about this subject (and the other discussion linked from there), the information given is quite slim.
Could anybody share any success stories made so far including source-code?
Has someone succeeded in using wkhtmltopdf?
(we plan to use Magnolia 4.5.6)
After evaluating both Aspose.pdf (commercial product) and iText, we went to use LaTex. We had quite some specific requirements (e.g. two column layout with footnotes, very large table), which were not possible with the two above mentioned products.
We are very happy with this solution, but there are some things to be noted: first and foremost you leave the JVM, and second LaTex is itself another macro language to be learned. The quality of the outcome is very good, although, and we are very happy with that solution.
wkhtmltopdf is used in another project, and the outcome is also good, for more straight forward formatting.

What's needed for NLP?

assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based on fields but in simple one line phrases instead.
EDIT: RTM is todo list system. So in order to enter a task you don't need to type in each field to fill values (task name, due date, location, etc). You can simply type in a phrase like "Dentist appointment monday at 2PM in WhateverPlace" and it will parse it and fill all fields for you.
I don't have any kind of technical constraints since it's going to be a personal project but I'm more familiar with .NET world. Actually, I'm not sure this is a matter of language but if it's necessary I'm more than willing to learn a new language to do it.
My project is related to personal finances so the phrases are more like "Spent 10USD on Coffee last night with my girlfriend" and it would fill location, amount of $$$, tags and other stuff.
Thanks a lot for any kind of directions that you might give me!
This does not appear to require full NLP. Simple pattern-based information extraction will probably suffice. The basic idea is to tokenize the text, then recognize/classify certain keywords, and finally recognize patterns/phrases.
In your example, tokenizing gives you "Dentist", "appointment", "monday", "at", "2PM", "in", "WhateverPlace". Your tool will recognize that "monday" is a day of the week, "2PM" is a time, etc. Finally, you can find patterns like [at] [TIME] and [in] [Place] and use those to fill in the fields.
A framework like GATE may help, but even that may be a larger hammer than you really need.
Have a look at NLTK, its a good resource for beginner programmers interested in NLP.
http://www.nltk.org/
It is written in python which is one of the easier programming languages.
Now that I understand your problem, here is my solution:
You can develop a kind of restricted vocabulary, in which all amounts must end witha $ sign or any time must be in form of 00:00 and/or end with AM/PM, regarding detecting items, you can use list of objects from ontology such as Open Cyc. Open Cyc can provide you with list of all objects such beer, coffee, bread and milk etc. this will help you to detect objects in the short phrase. Still it would be a very fuzzy approach.

Resources