UK License Number RUTA entity extraction - text-analysis

I have been developing a proof of concept of free text analytics. The RUTA scripts which I have developed for account number, date, salutations, addresses, pin codes, name seem to work properly.
But I am stuck on one rule where I want to extract the license number in UK format from a textual paragraph. The rule I developed seems to work properly when it is alone passed as input but for some reason it fails in a text.
Any help would be highly appreciated as I have been with this issue for quite sometime.
PACKAGE uima.ruta.example;
DECLARE VarA;
DECLARE VarB;
DECLARE VarC;
W{REGEXP("^(?i)(a-z){2}") -> MARK(VarA)}
NUM{REGEXP("..") -> MARK(VarB)}
W{REGEXP("(?i)(a-z){3}$") -> MARK(VarC), MARK(EntityType,1,3), UNMARK(VarA), UNMARK(VarB), UNMARK(VarC)};
The format which I am expecting is
C - Character
N - Number
CCNNCCC
CCNN CCC

Your question(or problem) is not totally clear for me. Also the example script doesn't work (EntityType is not declared and the regular expressions are not valid).
I made an example script. Maybe that will help you:

Related

How to use Kettle to handle the linefeed in a field

I want to use Kettle to handle a data set.
The format of the data set is as below
product/productId: B00006HAXW
review/userId: A1RSDE90N6RSZF
review/profileName: Joseph M. Kotow
review/helpfulness: 9/9
review/score: 5.0
review/time: 1042502400
review/summary: Pittsburgh - Home of the OLDIES
review/text: I have all of the doo wop DVD's and this one is as good or better than the
1st ones. Remember once these performers are gone, we'll never get to see them again.
Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE
this DVD !!
I read every line of the data first and then transform every eight rows into a record.
However, there is such data:
review/profileName: nancy "crzyfnyfrog
I love my purple pigtails."
This field contains a linefeed and I don't know how to handle it.
Now I use the script code to achieve the function I want, but I still want to know how to use based components to solve it.

VBScript - RANDOMIZE (Cbyte(Left(Right(Time(),5),2))) throws 500 for ES locales

Using a standard VBScript Randomize statement (below) which works fine -- most of the time.
...RANDOMIZE (Cbyte(Left(Right(Time(),5),2)))
RANDOMIZE...
It took a bit, but in digging thru log files, I've noticed that it throws this 500 error:
Type mismatch: 'Cbyte'
when the users' languages are non-English.
I tried changing the Session.LCID (I'm using Classic ASP) in a test page but no effect.
Any suggestions for fixing or a work-around? Thank you...
It appears you're trying to randomise based on the seconds-within-the-minute value:
12:34:56 AM
|---|
56 AM (right(5))
||
56 (left(2))
Now I have no idea of the top of my head what Time() would return in a Spanish locale, but it may well be something like 12:34:56 de la mañana.
What I do know is that relying on a specific presentation format in a globalised world is a bad idea. In your case, it may involve trying to convert left(right("12:34:56 de la mañana",5),2), or "añ", into a numeric value, something it's not going to be happy with.
If you want a true root cause analysis, I'd suggest catching the conversion error and actually logging what Time() is presenting itself as when it errors.
If you just want to fix it, find a way to get the seconds that doesn't depend on locale, for example:
secs = Second(Time())
As an aside, I'm not sure why you think this is even needed. The documentation for the VBScript Randomise function states that, if an argument is not given, the value returned by the system timer is used as the new seed value. Hence it's already based on the current time.

Generate table of all main languages

I have to create a list with all languages which should look like this:
Col 1 | Col 2
-----------------
English | English
German | Deutsch
French | Français
Spanish | Español
...
Col 1: Language in English
Col 2: Original Country Language
The list should cover all main languages (or in other words: all languages which you can translate in google translator)
Of course this takes quite a while.
Is it possible to generate this list with a script by using the Google API ?
Yes, google has a Translation API which is very similar to the Google Translator. There are a couple of endpoints which are of interest to you in this case.
There is a way to list all the available languages, which would populate your Col 1. By default, this returns all the language (and sometimes language-country) codes that are supported, but you can provide a target query parameter to also include the name of the language in a "target language". In your case, you would want to show it "en-US".
In theory, you could repeat this for every language code and then just use the result for the language's own language code to populate Col 2. (This may be the most accurate way, but you'll get back a lot of extra data you don't want.)
Of course, you can also just translate the text to get your Col 2 results.

Generate code in code_128 format from another code

I have a list of the following codes and their corresponding codes in format code_128. I want to given a string, be able to generate the corresponging code in CODE_128 format. Based on this list, how could I generate a code_128 number to the string A4Y9387VY34, for example?
code code in code_128
A4Y9387VY34 ????
ADN38Y644YT7 9611019020018632869509
AXCW99QYTD34 9622019021500078083444
A9YQC44W9J3K 9611083009710754539701
AT8V7T3G3874 9622083021255845940154
A7K444N4FKB8 9622083033510467186874
AYCHFW448HTQ 9611005019246067403120
AY63CWBMTDCC 9622005028182439033426
ANY7TF46NGQ3 9622005031345848081170
AYY48TBVQ3FH 9611200003793988696055
AT8Q4CF4DQ9Q 9611200021606968867090
A764WYQFJWTT 9622200022706968919275
AC649ND7N8B6 9622148007265209832185
A4VDPTJ99YN4 9611148013412173923039
AHDYK498BD6T 9622148021309216149530
A4YYYNY7C3DJ 9611017021934363499071
AYG6XWVCCQ89 9622017031009914238743
A68YJHGQKCCM 9622017031138587166053
APMB7XG9XQC9 9611021011608391750002
AGP8C44Y8VYK 9622021021608111646113
A7C68B9T69XB 9622021021958603678086
AJYYWKR6BDGN 9611010022528724015883
AKMNVXDT9PYN 9622010027475034102229
AXPXMK9QMDFD 9622010031475028243694
I read a lot about it, but I didn't come to any solution. Thanks in advance!!
Well, this is a pretty open question, I will give you my suggestions:
If it is a finite list, you can use a Hash or a Dictionary, where
the keys are the Codes and map them to the corresponding value, in
your case, Code_128
Some scanners have software installed that allow you to change what
has been read to a new value, format it, etc.
If you need a bigger insight please, give us more detail about the environment you are using.
Hope that helps,
I decided to create a new answer because now I get your point. Well, if you are talking about a GS1-128 Code (please see www.gs1.org) please do not start without visiting Wikipedia info about it. as you can see, there is a thorough explanation about how to work with that type of code. That code is composed by several application identifiers followed by their corresponding values. There is a better way of encoding them by using special characters as parenthesis. Here is other info that may help you.
Hope it helps,

iphone's phone number splitting algorithm?

iPhone has a pretty good telephone number splitting function, for example:
Singapore mobile: +65 9852 4135
Singapore resident line: +65 6325 6524
China mobile: +86 135-6952-3685
China resident line: +86 10-65236528
HongKong: +886 956-238-82
USA: +1 (732) 865-3286
Notice the nice features here:
- the splitting of country code, area code, and the rest is automatic;
- the delimiter is also nicely adopted to different countries, e.g. "()", "-" and space.
Note the parsing logic is doable to me, however, I don't know where to get the knowledge of most countries' telephone number format.
where could i found such knowledge, or an open source code that implemented it?
You can get similar functionality with the libphonenumber code library.
Interestingly enough, you cannot use an NSNumberFormatter for this, but you can write your own custom class for it. Just create a new class, set properties such as countryCode, areaCode and number, and then create a method that formats the number based on the countryCode.
Here's a great example: http://the-lost-beauty.blogspot.com/2010/01/locale-sensitive-phone-number.html
As an aside: a friend told me about a gigantic regular expression he had to maintain that could pick telephone numbers out of intercepted communications from hundreds of countries around the world. It was very non-trivial.
Thankfully your problem is easier, as you can just have a table with the per-country formats:
format[usa] = "+d (ddd) ddd-dddd";
format[hk] = "+ddd ddd-ddd-dd";
format[china_mobile] = "+dd ddd-dddd-dddd";
...
Then when you're printing, you simply output one digit from the phone number string in each d spot as needed. This assumes you know the country, which is a safe enough assumption for telephone devices -- pick "default" formats for the few surrounding countries.
Since some countries have different formats with different lengths you might need to store your table with additional information:
format[germany][10] = "..."
format[germany][11] = "....."

Resources