Google Natural Language python library has problems predicting when certain words in sentence - google-cloud-automl

On the portal, I can insert a sentence and get a score back. If I use the python library I get sentences with no scores. Upon further investigation, it turns out a single word (without punctuation) prevents the prediction. If I replace this word with another it works, if I replace it with 2 words it works, if I replace it with "United States" however, which is different from the original word, I also get no sentiment score. None of this is an issue on the portal so either its the python library or the portal is using a different predictor engine.
Anyone run into this before and have a solution. I am going to have to look at their rest interface now as I have lost confidence in the python library

c# library works fine - way to go google for a shoddy python library

Related

Using Google's API to split string into words?

I'm trying to figure out which API I should use to get Google to intelligently split a string into words.
Input:
thequickbrownfoxjumpsoverthelazydog
Output:
the quick brown fox jumps over the lazy dog
When I go to Google Translate and input the string (with auto-detect language) and click on the "Listen" icon for Google to read out the string, it breaks up the words and reads it out correctly. So, I know they're able to do it.
But what I can't figure out is if it's the API for Google Translate or their Text-To-Speech API that's breaking up the words. Or if there's any way to get those broken up words in an API response somewhere.
Does anyone have experience using Google's APIs to do this?
AFAIK, there isn't an API in Google Cloud that does that specifically, although, it looks like when you translate text using the Translation API it is indeed parsing the concatenated words in the background.
So, as you can't use it with the same source language as the target language, what you could do is translate to any language and then translate back to the original language. This seems a bit overkill though.
You could create a Feature Request to ask for such a feature to be implemented in the NLP API for example.
But, depending on your use case, I suppose that you could also use the method suggested in this other Stackoverflow Answer that uses dynamic programming to infer the location of spaces in a string without spaces.
Another user even made a pip package named wordninja (See second answer on the same post) based on that.
pip3 install wordninja to install it.
Example usage:
$ python
>>> import wordninja
>>> wordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Use Google's libphonenumber with BaseX

I am using BaseX 9.2 to scrape an online phone directory. Nothing illegal, it belongs to a non-profit that my boss is a member in, so I have access to it. What I want is to add all those numbers to my personal phonebook so that I can know who is calling me (mainly to contact my boss). The data is in pretty bad shape, especially the numbers (about a thousand numbers, from all over the world). Some are in E164, some are not, some are downright invalid numbers.
I initially used OpenRefine 3.0 to cleanup the data. It also plays very nicely with Google's libphonenumber to whip the numbers in shape. It was as simple as downloading the JAR from Maven, putting it in OpenRefine's lib directory and then invoking Jython like this on each phone number (numberStr):
from com.google.i18n.phonenumbers import PhoneNumberUtil
from com.google.i18n.phonenumbers.PhoneNumberUtil import PhoneNumberFormat
pu = PhoneNumberUtil.getInstance()
numberStr = str(int(value))
number = pu.parse('+' + numberStr, 'ZZ')
try: country = pu.getRegionCodeForNumber(number)
except: country = 'US'
number = pu.parse(numberStr, (country if pu.isValidNumberForRegion(number, country) else 'US'))
return pu.format(number, PhoneNumberFormat.E164)
I discovered XPath and BaseX recently and find it to be very succint and powerful with HTML. While I could get OpenRefine to directly spit out a VCF, I can't find a way to plugin libphonenumber with BaseX. Since both are in Java, I thought it would be straight forward.
I tried their documentation (http://docs.basex.org/wiki/Java_Bindings), but BaseX does not discover the libphonenumber JAR out-of-the-box. I tried various path, renaming and location combinations. The only way I see is to write a wrapper and make it into an XQuery module (XAR) and import it. This will need significant time and Java coding skills and I definitely don't have the later.
Is there a simple way to hookup libphonenumber with BaseX? Or in general, is there a way to link external Java libs with XPath? I could go back to OpenRefine, but it has a very clumsy workflow IMHO. No way to ask the website admin to cleanup his act, either. Or, if OpenRefine and BaseX are not the right tools for the job, any other way to cleanup data, especially phone numbers? I need to do this every few months (for changes and updates on the site) and it's getting really tedious if I can't automate it fully.
Would want at least a basic working code sample for an answer .. (I directly work off the standalone BaseX JAR on a Windows 10 x64 machine)
Place libphonenumber-8.10.16.jar in the folder ..basex/lib/custom to get it on the classpath (see http://docs.basex.org/wiki/Startup#Full_Distributions) and run bin/basexgui.bat
declare namespace Pnu="java:com.google.i18n.phonenumbers.PhoneNumberUtil";
declare namespace Pn="java:com.google.i18n.phonenumbers.Phonenumber$PhoneNumber";
let $pnu:=Pnu:getInstance()
let $pn:= Pnu:parse($pnu,"044 668 18 00","CH")
return Pn:getCountryCode($pn)
Returns the string "41"
There is no standard way to call Java from XPath, however many Java based XPath implementations provide custom methods to do this.

How accurate is Google's libphonenumber?

I'm wanting to incorporate Google's libphonenumber library into a CRM solution that I'm working on, to identify things such as:
Whether a phone number is mobile or landline
Geo-location of the number
I've done some searching online, and can't seem to find anything discussing what algorithms the library is using to determine this information, and how reliable those methods are.
Is there any such documentation (ie, details of the these algorithms and their respective reliability)? Or really, anything to help me understand what happens under-the-covers for this library?
It's an Open Source library, so you can see exactly how it works :)
svn checkout http://code.google.com/p/libphonenumber/source/checkout
I've had a quick look at the source, and it seems to work by testing the phone number with a series of regular expressions. Big regex files are defined for various countries, which define the regular expressions that will tell you the type of phone number (for example, in the UK, all mobiles start with "07", so there will be a regex based on that).

Bing/Google/Flickr API: how would you find an image to go along each of 150,000 Japanese sentences?

I'm doing part-of-speech & morphological analysis project for Japanese sentences. Each sentence will have its own webpage. To make this page more visual, I want to show one picture which is somehow related to the sentence. For example, For the sentence "私は学生です" ("I'm a student"), the relevant pictures would be pictures of school, Japanese textbook, students, etc. What I have: part-of-speech tagging for every word. My approach now: use 2-3 nouns from every sentence and retrieve the first image from search results using Bing Images API. Note: all the sentence processing up to this point was done in Java.
Have a couple of questions though:
1) what is better (richer corpus & powerful search), Google Images API, Bing Images API, Flickr API, etc. for searching nouns in Japanese?
2) how do you select the most important noun from the sentence to do the query in Image Search Engine without doing complicated topic modeling, etc.?
Thanks!
Japanese WordNet has links to OpenClipart pictures. That could be another relevant source. They describe it in their paper called "Enhancing the Japanese WordNet".
I thought you would start by choosing any noun before は、が and を and giving these priority - probably in that order.
But that assumes that your part-of-speech tagging is good enough to get は=subject identified properly (as I guess you know that は is not always the subject marker).
I looked at a bunch of sample sentences here with this technique in mind and found it as good as could be expected. Except where none of those are used, which is rarish.
And sentences like this one, where you'd have to consider maybe looking for で and a noun before it in the case where there is no を or は. Because if you notice here, the word 人 (people) really doesn't tell you anything about what's being said. Without parsing context properly, you don't even know if the noun is person or people.
毎年 交通事故で 多くの人が 死にます
(many people die in traffic accidents every year)
But basically, couldn't you implement a priority/fallback type system like this?
BTW I hope your sentences all use kanji, or when you see はし (in one of the sentences linked to) you won't know whether to show a bridge or chopsticks - and showing the wrong one will probably not be good.

What's needed for NLP?

assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based on fields but in simple one line phrases instead.
EDIT: RTM is todo list system. So in order to enter a task you don't need to type in each field to fill values (task name, due date, location, etc). You can simply type in a phrase like "Dentist appointment monday at 2PM in WhateverPlace" and it will parse it and fill all fields for you.
I don't have any kind of technical constraints since it's going to be a personal project but I'm more familiar with .NET world. Actually, I'm not sure this is a matter of language but if it's necessary I'm more than willing to learn a new language to do it.
My project is related to personal finances so the phrases are more like "Spent 10USD on Coffee last night with my girlfriend" and it would fill location, amount of $$$, tags and other stuff.
Thanks a lot for any kind of directions that you might give me!
This does not appear to require full NLP. Simple pattern-based information extraction will probably suffice. The basic idea is to tokenize the text, then recognize/classify certain keywords, and finally recognize patterns/phrases.
In your example, tokenizing gives you "Dentist", "appointment", "monday", "at", "2PM", "in", "WhateverPlace". Your tool will recognize that "monday" is a day of the week, "2PM" is a time, etc. Finally, you can find patterns like [at] [TIME] and [in] [Place] and use those to fill in the fields.
A framework like GATE may help, but even that may be a larger hammer than you really need.
Have a look at NLTK, its a good resource for beginner programmers interested in NLP.
http://www.nltk.org/
It is written in python which is one of the easier programming languages.
Now that I understand your problem, here is my solution:
You can develop a kind of restricted vocabulary, in which all amounts must end witha $ sign or any time must be in form of 00:00 and/or end with AM/PM, regarding detecting items, you can use list of objects from ontology such as Open Cyc. Open Cyc can provide you with list of all objects such beer, coffee, bread and milk etc. this will help you to detect objects in the short phrase. Still it would be a very fuzzy approach.

Resources