Corenlp server shows different NER result from local 3.8 version - stanford-nlp

I uses the sentense
He died in the day before yesterday.
to process corenlp NER.
On the server, I got the result like this.
And in local, I uses the same sentence, got the result of
He(O) died(O) in(O) the(O) day(TIME) before(O) yesterday(O) .(O)
So, how can I get the same result like the server?

In order to increase the likelihood of getting a relevant answer, you may want to rephrase your question and provide a bit more information. And as a bonus, in the process of doing so, you may even find out the answer yourself ;)
For example, what url are you using to get your server result? When I check here: http://nlp.stanford.edu:8080/ner/process , I can select multiple models for English. Not sure which version their API is based on (would say the most recent stable version, but I don't know). Then the title of your post suggests you are using 3.8 locally, but it wouldn't hurt to specify the relevant piece in your pom.xml file, or the models you downloaded yourself.
What model are you using in your code? How are you calling it? (i.e. any other annotators in your pipeline that could be relevant for NER output)
Are you even calling it from code (if so, Java? Python?), or using it from the command line?
A lot of this is summarised in https://stackoverflow.com/help/how-to-ask and it's not that long to read through ;)

Related

next release of Stanza

I'm interested in the Stanza constituency parser for Italian.
In https://stanfordnlp.github.io/stanza/constituency.html it is said that a new release with updated models (including an Italian model trained on the Turin treebank) should have been available in mid-November.
Any idea about when the next release of Stanza will appear?
Thanks
alberto
Technically you can already get it! If you install the dev branch of stanza, you should be able to download an IT parser.
pip install git+git://github.com/stanfordnlp/stanza.git#704d90df2418ee199d83c92c16de180aacccf5c0
stanza.download("it")
It's trained on the Turin treebank, which has about 4000 trees. If you download the Bert version of the model, it gets over 91 F1 on the Evalita test set (but has a length limit of about 200 words per sentence).
We might splurge on getting the VIT treebank or something. I've been agitating that we use that budget on Danish or PT or some other language where we have very few users, but it's a hard sell...
Edit: there's also some scripts included for converting the publicly available Turin trees into brackets. Their MWT annotation style was to repeat the MWT twice in a row, which doesn't doesn't work too well for a task like parsing raw text.
It is still very much a live task ... either December or January, I would say.
p.s. This isn't really a great SO question....

Use Google's libphonenumber with BaseX

I am using BaseX 9.2 to scrape an online phone directory. Nothing illegal, it belongs to a non-profit that my boss is a member in, so I have access to it. What I want is to add all those numbers to my personal phonebook so that I can know who is calling me (mainly to contact my boss). The data is in pretty bad shape, especially the numbers (about a thousand numbers, from all over the world). Some are in E164, some are not, some are downright invalid numbers.
I initially used OpenRefine 3.0 to cleanup the data. It also plays very nicely with Google's libphonenumber to whip the numbers in shape. It was as simple as downloading the JAR from Maven, putting it in OpenRefine's lib directory and then invoking Jython like this on each phone number (numberStr):
from com.google.i18n.phonenumbers import PhoneNumberUtil
from com.google.i18n.phonenumbers.PhoneNumberUtil import PhoneNumberFormat
pu = PhoneNumberUtil.getInstance()
numberStr = str(int(value))
number = pu.parse('+' + numberStr, 'ZZ')
try: country = pu.getRegionCodeForNumber(number)
except: country = 'US'
number = pu.parse(numberStr, (country if pu.isValidNumberForRegion(number, country) else 'US'))
return pu.format(number, PhoneNumberFormat.E164)
I discovered XPath and BaseX recently and find it to be very succint and powerful with HTML. While I could get OpenRefine to directly spit out a VCF, I can't find a way to plugin libphonenumber with BaseX. Since both are in Java, I thought it would be straight forward.
I tried their documentation (http://docs.basex.org/wiki/Java_Bindings), but BaseX does not discover the libphonenumber JAR out-of-the-box. I tried various path, renaming and location combinations. The only way I see is to write a wrapper and make it into an XQuery module (XAR) and import it. This will need significant time and Java coding skills and I definitely don't have the later.
Is there a simple way to hookup libphonenumber with BaseX? Or in general, is there a way to link external Java libs with XPath? I could go back to OpenRefine, but it has a very clumsy workflow IMHO. No way to ask the website admin to cleanup his act, either. Or, if OpenRefine and BaseX are not the right tools for the job, any other way to cleanup data, especially phone numbers? I need to do this every few months (for changes and updates on the site) and it's getting really tedious if I can't automate it fully.
Would want at least a basic working code sample for an answer .. (I directly work off the standalone BaseX JAR on a Windows 10 x64 machine)
Place libphonenumber-8.10.16.jar in the folder ..basex/lib/custom to get it on the classpath (see http://docs.basex.org/wiki/Startup#Full_Distributions) and run bin/basexgui.bat
declare namespace Pnu="java:com.google.i18n.phonenumbers.PhoneNumberUtil";
declare namespace Pn="java:com.google.i18n.phonenumbers.Phonenumber$PhoneNumber";
let $pnu:=Pnu:getInstance()
let $pn:= Pnu:parse($pnu,"044 668 18 00","CH")
return Pn:getCountryCode($pn)
Returns the string "41"
There is no standard way to call Java from XPath, however many Java based XPath implementations provide custom methods to do this.

How can one create a polyglot PDF?

I like reading the PoC||GTFO issues and one thing I found remarkable when I first discovered it, was the "polyglot" nature of their PDF files.
Let met explain: when you consider for example their 8th issue, you may unzip files from it; execute the encryption they are talking about by running it as a script and even better(worse?) with their 9th issue you can even play it as a music file!
I'm currently in the process of writing small scripts every week and writing each time a little one page PDF in LaTeX to explain the said scripts. So I would really enjoy being able to create the same kind of PDF files. Sadly they explained (partly) in their first issue how to include zip files, but they did so through three small sketches of cmd lines without actual explanations.
So my question is basically :
how can one create such a polyglot PDF file containing stuff like a zip as well as being a shell script which may be run using arguments just like normal scripts?
I'm asking here about the process of creation, not just an explanation of how this is possible. The ideal way for me would that there are already some scripts or programs allowing to create easily such PDF files.
I've tried to search the net for the keywords "polyglot files" and others of the kind and wasn't able to find any useful matches. Maybe this process has another name?
I've already read the presentation by Julia Wolf which explains how things works, but I sadly haven't had time to apply the knowledge there to real world, because I'm sadly not used to play with file headers and the way a PDF is constructed.
EDIT:
Okay, I've read more and found the 7th edition of PoC||GTFO to be really informative concerning this subject. I may end up being able to create my own scripts to do such polyglot PDF files if I have some more time to consider it.
I played around with polyglots myself after attending Ange's talks and also talking to him in person. You really need to understand the file formats to be able to nest them into each other.
However, long story short, here are some links I found extremely useful for creating polyglots:
Some older Google Code Trunk
PoC of the polyglot stuff
Especially the second link (to github) will help you creating polyglots, but also understanding how they are working and how they are implemented. Since it is mostly Python stuff and very well / clean written, it is very useful and easy to follow.
I feel dissecting some file formats would be a good place to start. You can find many file format specifications for different file types through Google, but they can be a tough read and will likely take you some time to translate into whatever language you are using.
PDF: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
ELF: https://www.cs.cmu.edu/afs/cs/academic/class/15213-s00/doc/elf.pdf
ZIP: http://kat.sdf.org/zip_file_format.txt
The language(s) you select will need a way to read and write raw bytes (not just ascii alphanumeric), so perhaps C would be good for more direct access to memory. Some Python tricks could help with open sourcing the scripts easily.
To dissect the files, you may want to build a tool kinda like https://github.com/kvesel/zipbrk/ to take them apart, then put them all back together in a polyglot format. For example, zip does not require the section headers to be at the start (or even contiguous for that matter), and PDF magic number can appear in multiple places within the file as well. I also believe I recall a polyglot tool being included in one of the PoC||GTFO publishings (maybe issue 8 or 2??) as a polyglot in the pdf file.
Don't forget the hackers bible! :)
https://nostarch.com/gtfo

How to store and find records based on location?

I'm thinking of building an application that helps you find local businesses (just an example). You might enter your zip code (or GPS if this is on a phone) and find the closest business within 10 miles, etc. My question is how can I achieve this type of logic? Is there a library or service that I will need to use? Or can this be done with math? I'm not familiar with how this sort of thing usually works, so let me know how I need to store the records so I can query them later. Note: I will be using Ruby and MongoDB.
It should be easy to find the math to solve that, providing lat/long coordinates.
Or you could use some full featured gem to do that for you like Geocoder, that supports Mongoid or MongoMapper.
Next time you need some feature that might be a commun world problem, first check if there is a gem for that at ruby-toolbox, for this case here are some other gems for geocoding
One more solution here...
http://geokit.rubyforge.org/
I think, this topic is already discussed here..

Backpropogation Through Time with Snarli

This question stemmed from the following post with a recommendation to use Snarli for Backpropogation Through Time. I tried it out for regular Backpropogation and it works great. However, I'm not sure about backprop through time. With the limited documentation I can't quite tell how to do it. I used BpptUpdate, but I need to set some momentum term for a layer. I'm a little confused by this (which layer to set and how).
Anyway, just looking for a quick response and I understand it is probably a very limited audience who has used Snarli. My next step is to email the author if I don't hear anything and I figured I could post the answer.
So, maybe this goes without saying, but after emailing the author I came to find that examples are found in the CVS repository (not in the .jar file) or in the snarli-apps compressed files at http://sourceforge.net/projects/snarli/files/snarli/Beta0.21/.
An example for BPTT is found in the Caudill file, the Elman loop is found in elman, etc.

Resources