Use Google's libphonenumber with BaseX - xpath

I am using BaseX 9.2 to scrape an online phone directory. Nothing illegal, it belongs to a non-profit that my boss is a member in, so I have access to it. What I want is to add all those numbers to my personal phonebook so that I can know who is calling me (mainly to contact my boss). The data is in pretty bad shape, especially the numbers (about a thousand numbers, from all over the world). Some are in E164, some are not, some are downright invalid numbers.
I initially used OpenRefine 3.0 to cleanup the data. It also plays very nicely with Google's libphonenumber to whip the numbers in shape. It was as simple as downloading the JAR from Maven, putting it in OpenRefine's lib directory and then invoking Jython like this on each phone number (numberStr):
from com.google.i18n.phonenumbers import PhoneNumberUtil
from com.google.i18n.phonenumbers.PhoneNumberUtil import PhoneNumberFormat
pu = PhoneNumberUtil.getInstance()
numberStr = str(int(value))
number = pu.parse('+' + numberStr, 'ZZ')
try: country = pu.getRegionCodeForNumber(number)
except: country = 'US'
number = pu.parse(numberStr, (country if pu.isValidNumberForRegion(number, country) else 'US'))
return pu.format(number, PhoneNumberFormat.E164)
I discovered XPath and BaseX recently and find it to be very succint and powerful with HTML. While I could get OpenRefine to directly spit out a VCF, I can't find a way to plugin libphonenumber with BaseX. Since both are in Java, I thought it would be straight forward.
I tried their documentation (http://docs.basex.org/wiki/Java_Bindings), but BaseX does not discover the libphonenumber JAR out-of-the-box. I tried various path, renaming and location combinations. The only way I see is to write a wrapper and make it into an XQuery module (XAR) and import it. This will need significant time and Java coding skills and I definitely don't have the later.
Is there a simple way to hookup libphonenumber with BaseX? Or in general, is there a way to link external Java libs with XPath? I could go back to OpenRefine, but it has a very clumsy workflow IMHO. No way to ask the website admin to cleanup his act, either. Or, if OpenRefine and BaseX are not the right tools for the job, any other way to cleanup data, especially phone numbers? I need to do this every few months (for changes and updates on the site) and it's getting really tedious if I can't automate it fully.
Would want at least a basic working code sample for an answer .. (I directly work off the standalone BaseX JAR on a Windows 10 x64 machine)

Place libphonenumber-8.10.16.jar in the folder ..basex/lib/custom to get it on the classpath (see http://docs.basex.org/wiki/Startup#Full_Distributions) and run bin/basexgui.bat
declare namespace Pnu="java:com.google.i18n.phonenumbers.PhoneNumberUtil";
declare namespace Pn="java:com.google.i18n.phonenumbers.Phonenumber$PhoneNumber";
let $pnu:=Pnu:getInstance()
let $pn:= Pnu:parse($pnu,"044 668 18 00","CH")
return Pn:getCountryCode($pn)
Returns the string "41"
There is no standard way to call Java from XPath, however many Java based XPath implementations provide custom methods to do this.

Related

Generate EDGAR FTP File Path List

I'm brand new to programming (though I'm willing to learn), so apologies in advance for my very basic question.
The [SEC makes available all of their filings via FTP][1], and eventually, I would like to download a subset of these files in bulk. However, before creating such a script, I need to generate a list for the location of these files, which follow this format:
/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm
51143 = the company ID, and I already accessed the list of company IDs I need via FTP
000005114313000007/0000051143-13-000007 = the report ID, aka "accession number"
I'm struggling with how to figure this out as the documentation is fairly light. If I already have the 000005114313000007/0000051143-13-000007 (what the SEC calls the "accession number") then it's pretty straightforward. But I'm looking for ~45k entries and would obviously need to generate these automatically for a given CIK ID (which I already have).
Is there an automated way to achieve this?
Welcome to SO.
I'm currently scraping the same site, so I'll explain what I've done so far. What I am assuming is that you'll have the CIK numbers of the companies you're looking to scrape. If you search the company's CIK, you'll get a list of all of the files that are available for the company in question. Let's use Apple as an example (since they have a TON of files):
Link to Apple's Filings
From here you can set a search filter. The document you linked was a 10-Q, so let's use that. If you filter 10-Q, you'll have a list of all of the 10-Q documents. You'll notice that the URL changes slightly, to accommodate for the filter.
You can use Python and its web scraping libraries to take that URL and scrape all of the URLs of the documents in the table on that page. For each of these links you can scrape whatever links or information you want off the page. I personally use BeautifulSoup4, but lxml is another choice for web scraping, should you choose Python as your programming language. I would recommend using Python, as it's fairly easy to learn the basics and some intermediate programming constructs.
Past that, the project is yours. Good luck, I've posted some links below to get you started. I'm only allowed to post two links since I'm new to the site, so I'll give you the beautiful soup link:
Beautiful Soup Home Page
If you choose to use Python and are new to the language, check out the codecademy python course, and don't forget to check out lxml, as some people prefer it over BeautifulSoup (some people also use both in conjunction, so it's all a matter of personal preference).

How accurate is Google's libphonenumber?

I'm wanting to incorporate Google's libphonenumber library into a CRM solution that I'm working on, to identify things such as:
Whether a phone number is mobile or landline
Geo-location of the number
I've done some searching online, and can't seem to find anything discussing what algorithms the library is using to determine this information, and how reliable those methods are.
Is there any such documentation (ie, details of the these algorithms and their respective reliability)? Or really, anything to help me understand what happens under-the-covers for this library?
It's an Open Source library, so you can see exactly how it works :)
svn checkout http://code.google.com/p/libphonenumber/source/checkout
I've had a quick look at the source, and it seems to work by testing the phone number with a series of regular expressions. Big regex files are defined for various countries, which define the regular expressions that will tell you the type of phone number (for example, in the UK, all mobiles start with "07", so there will be a regex based on that).

How to store and find records based on location?

I'm thinking of building an application that helps you find local businesses (just an example). You might enter your zip code (or GPS if this is on a phone) and find the closest business within 10 miles, etc. My question is how can I achieve this type of logic? Is there a library or service that I will need to use? Or can this be done with math? I'm not familiar with how this sort of thing usually works, so let me know how I need to store the records so I can query them later. Note: I will be using Ruby and MongoDB.
It should be easy to find the math to solve that, providing lat/long coordinates.
Or you could use some full featured gem to do that for you like Geocoder, that supports Mongoid or MongoMapper.
Next time you need some feature that might be a commun world problem, first check if there is a gem for that at ruby-toolbox, for this case here are some other gems for geocoding
One more solution here...
http://geokit.rubyforge.org/
I think, this topic is already discussed here..

Visual Studios: Getting standard font... files?

I've run into another problem while fixing up my game to use this library.
SDL.NET, the library I'm using for graphics and input in my VB.NET app, has its own special Font class which is entirely separate from System.Drawing.Font. Here are its two constructors:
public Font(string fileName, int pointSize)
public Font(byte[] array, int pointSize)
Both need a file, in glaring contrast to System.Drawing.Font (which just needs the font family name). I'm not sure where these files are. My first instinct was to look in Windows\Fonts for the ones I want to use, but... you can guess how that approach failed.
I need to find the files for Cambria, DotumChe, and Photo (all of which came installed on the computer). My program is very far from completion, so I'm not worrying about the legal complications of what I'm trying to do. I just want to find the files and get them in my project so I can move on. Is there a place on my computer where I can find them?
A C# example is available at Steve's Tech Talk. It uses P/Invoke to get the fonts folder path.
I'm assuming you plan on resolving legal/licensing issues before distributing the game.

How do you manage the String Translation Process?

I am working on a Software Project that needs to be translated into 30 languages. This means that changing any string incurs into a relatively high cost. Additionally, translation does not happen overnight, because the translation package needs to be worked by different translators, so this might take a while.
Adding new features is cumbersome somehow. We can think up all the Strings that will be needed before we actually code the UI, but sometimes still we need to add new strings because of bug fixes or because of an oversight.
So the question is, how do you manage all this process? Any tips in how to ease the impact of translation in the software project? How to rule the strings, instead of having the strings rule you?
EDIT: We are using Java and all Strings are internationalized using Resource Bundles, so the problem is not the internationalization per-se, but the management of the strings.
I'm not sure the platform you're internationalizing in. I've written an answer before on the best way to il8n an application. See What do I need to know to globalize an asp.net application?
That said - managing the translations themselves is hard. The problem is that you'll be using the same piece of text across multiple pages. Your framework may not, however, support only having that piece of text in one file (resource files in asp.net, for instance, encourage you to have one resource file per language).
The way that we found to work with things was to have a central database repository of translations. We created a small .net application to import translations from resource files into that database and to export translations from that database to resource files. There is, thus, an additional step in the build process to build the resource files.
The other issue you're going to have is passing translations to your translation vendor and back. There are a couple ways for this - see if your translation vendor is willing to accept XML files and return properly formatted XML files. This is, really, one of the best ways, since it allows you to automate your import and export of translation files. Another alternative, if your vendor allows it, is to create a website to allow them to edit the translations.
In the end, your answer for translations will be the same for any other process that requires repetition and manual work. Automate, automate, automate. Automate every single thing that you can. Copy and paste is not your friend in this scenario.
Pootle is an webapp that allows to manage translation process over the web.
There are a number of major issues that need to be considered when internationalizing an application.
Not all strings are created equally. Depending upon the language, the length of a sentence can change significantly. In some languages, it can be half as long and in others it can be triple the length. Make sure to design your GUI widgets with enough space to handle strings that are larger than your English strings.
Translators are typically not programmers. Do not expect the translators to be able to read and maintain the correct file formats for resource files. You should setup a mechanism where you can transform the translated data round trip to your resource files from something like an spreadsheet. One possibility is to use XSL filters with Open Office, so that you can save to Resource files directly in a spreadsheet application. Also, translators or translation service companies may already have their own databases, so it is good to ask about what they use and write some tools to automate.
You will need to append data to strings - don't pretend that you will never have to or you will always be able to put the string at the end. Make sure that you have a string formatter setup for replacing placeholders in strings. Furthermore, make sure to document what are typical values that will be replaced for the translators. Remember, the order of the placeholders may change in different languages.
Name your i8n string variables something that reflects their meaning. Do you really want to be looking up numbers in a resource file to find out what is the contents of a given string. Developers depend on being able to read the string output in code for efficiency a lot more than they often realize.
Don't be afraid of code-generation. In my current project, I have written a small Java program that is called by ant that parses all of the keys of the default language (master) resource file and then maps the key to a constant defined in my localization class. See below. The lines in between the //---- comments is auto-generated. I run the generator every time I add a string.
public final class l7d {
...normal junk
/**
* Reference to the localized strings resource bundle.
*/
public static final ResourceBundle l7dBundle =
ResourceBundle.getBundle(BUNDLE_PATH);
//---- start l7d fields ----\
public static final String ERROR_AuthenticationException;
public static final String ERROR_cannot_find_algorithm;
public static final String ERROR_invalid_context;
...many more
//---- end l7d fields ----\
static {
//---- start setting l7d fields ----\
ERROR_AuthenticationException = l7dBundle.getString("ERROR_AuthenticationException");
ERROR_cannot_find_algorithm = l7dBundle.getString("ERROR_cannot_find_algorithm");
ERROR_invalid_context = l7dBundle.getString("ERROR_invalid_context");
...many more
//---- end setting l7d fields ----\
}
The approach above offers a few benefits.
Since your string key is now defined as a field, your IDE should support code completion for it. This will save you a lot of type. It get's really frustrating looking up every key name and fixing typos every time you want to print a string.
Someone please correct me if I am wrong. By loading all of the strings into memory at static instantiation (as in the example) will result in a quicker load time at the cost of additional memory usage. I have found the additional amount of memory used is negligible and worth the trade off.
The localised projects I've worked on had 'string freeze' dates. After this time, the only way strings were allowed to be changed was with permission from a very senior member of the project management team.
It isn't exactly a perfect solution, but it did enable us to put defects regarding strings on hold until the next release with a valid reason. Once the string freeze has occured you also have a valid reason to deny adding brand new features to the project on 'spur of the moment' decisions. And having the permission come from high up meant that middle managers would have no power to change specs on your :)
If available, use a database for this. Each string gets an id, and there is either a table for each language, or one table for all with the language in a column (depending on how the site is accessed the performance dictates which is better). This allows updates from translators without trying to manage code files and version control details. Further, it's almost trivial to run reports on what isn't translated, and keep track of what was an autotranslation (engine) vs a real human translation.
If no database, then I stick each language in a separate file so version control issues are reduced. But the structure is basically the same - each string has an id.
-Adam
Not only did we use a database instead of the vaunted resource files (I have never understood why people use something like that which is a pain to manage, when we have such good tools for dealing with databases), but we also avoided the need to tag things in the application (forgetting to tag controls with numbers in VB6 Forms was always a problem) by using reflection to identify the controls for translation. Then we use an XML file which translates the controls to the phrase IDs from the dictionary database.
Although the mapping file had to be managed, it could still be managed independent of the build process, and the translation of the application was actually possible by end-users who had rights in the database.
The solution we came up to so far is having a small application in Excel that reads all the property files, and then shows a matrix with all the translations (languages as headers, keys as rows). It is quite evident what is missing then. This is send to the translators. When it comes back, then the sheet can be processed to generate the same property bundles back again. So far it has eased the pain somewhat, but I wonder what else is around.
This google book - resource file management gives some good tips
You can use Resource File Management software to keep track of strings that have changed and control the workflow to get them translated - otherwise you end up in a mess of freezes and overbearing version control
Some tools that do this sort of thing - no connection an I haven't actually used them, just researching
http://www.sisulizer.com/
http://www.translationzone.com/en/products/
I put in a makefile target that finds all the .properties files and puts them in a zip file to send off to the translators. I offered to send them just diffs, but for some reason they want the whole bundle of files each time. I think they have their own system for tracking just differences, because they charge us based on how many strings have changed from one time to the next. When I get their delivery back, I manually diff all their files with the previous delivery to see if anything unexpected has changed - one time all the PT_BR (Brazillian Portuguese) strings changed, and it turns out they'd used a PT_PT (Portuguese Portuguese) translator for that batch in spite of the order for PT_BR.
In Java, internationalization is accomplished by moving the strings to resource bundles ... the translation process is still long and arduous, but at least it's separated from the process of producing the software, releasing service packs etc. One thing that helps is to have a CI system that repackages everything any time changes are made. We can have a new version tested and out in a matter of minutes whether it's a code change, new language pack or both.
For starters, I'd use default strings in case a translation is missing. For example, the English or Spanish value.
Secondly, you might want to consider a web app or something similar for your translators to use. This requires some resources upfront, but at least you won't need to send files around and it will be obvious for the translators which strings are new, etc.

Resources