Use machine learning to validate phone numbers - validation

So I am looking to validate user input phone numbers.
So far I have been doing so with Regex. But with different phone number formats from all around the world it's been getting hard to maintain the Regex.
Since I have a lot of datasets of valid phone numbers I figured it might be possible to use a machine learning algorithm.
Because I don't have any prior experience with machine learning, I tried to prototype it by using scikitlearn SVM. It didn't work.
Now I'm curios if this is even a good use case for a machine learning algorithm. If it is, what are some resources I should lookup?
If not, what are some alternatives to machine learning to create a easy to extend phone number validation?

This is a case of mere computer programming, you probably need to refactor your code into some kind of a class that's responsible for validating phone numbers from different countries.
Also from a regex perspective, the question of updating it for international phone numbers have been asked here: What regular expression will match valid international phone numbers? and the best answer is to use the following regex:
\+(9[976]\d|8[987530]\d|6[987]\d|5[90]\d|42\d|3[875]\d|
2[98654321]\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|
4[987654310]|3[9643210]|2[70]|7|1)\d{1,14}$
Regarding machine learning, here's a nice summary of what questions machine learning can answer, which can be summarized in the following list:
Is this A or B?
Is this weird?
How much/how many?
How is it organized?
What should I do next?
Check the blog article (there is also a video within the article) for more details. Your question doesn't really fit in any of the above five categories.

International phone number rules are immensely complicated so it's unlikely a regex will work. Training a machine learning algorithm could potentially work if you have enough data, but there are some weird edge cases and formatting variables (including multiple ways of expressing the same phone number) that would make life difficult.
A better option is to use Google's libphonenumber. It's an open source phone number validation library implemented in C++ and Java, with ports for quite a few other languages.

The given task is Syntax-restricted + subject to Regulatory procedures
Machine Learning would need such a super-set training DataSET, so as to meet the ( Hoeffding's Inequality constrained ) projected error-rate, which is for low level targets by far principally ( almost ) impossible to arrange to train at.
So even the regex-tools are ( almost ) guessing, as the terminal parts of the E.164-"address" are ( almost ) un-maintainable for the global address-space.
Probabilistic ML-learners may get somewhat sense for being harnessed here, but again - these will even knowingly guess ( with a comfort of providing a working estimate of a confidence level achieved by each and every such guess ).
Why?
Because each telephone number ( and here we do not assume the lexical irregularities and similar cosmetic details ) must be conform both the a global set of regulations ( ITU-T governed ), then -- on a lower level -- subject to national set of regulations ( multi-party governed ), and finally there are two distinct phone-number E.164-"address"-assignment procedures, not make the story a bit easier.
An ITU-T RFC 4725 - brief view:
just to realise the [ ITU-T [, NNPA [, CSP [, <privateAdmin> ]]]]-hierarchy of distributed rules, introduced into an ( absolute syntax - distributed governance in ) E.164 number-blocks analyses ( down to an individual number ).
RFC 4725 ENUM Validation Architecture November 2006
These two variants of E.164 number assignment are depicted in
Figure 2:
+--------------------------------------------+
| International Telecommunication Union (ITU)|
+--------------------------------------------+
|
Country codes (e.g., +44)
|
v
+-------------------------------------------+
| National Number Plan Administrator (NNPA) |------------+
+-------------------------------------------+ |
| |
Number Ranges |
(e.g., +44 20 7946 xxxx) |
| |
v |
+--------------------------------------+ |
| Communication Service Provider (CSP) | |
+--------------------------------------+ |
| |
| Single Numbers
Either Single Numbers (e.g., +44 909 8790879)
or Number Blocks (Variant 2)
(e.g., +44 20 7946 0999, +44 20 7946 07xx) |
(Variant 1) |
| |
v |
+----------+ |
| Assignee |<------------------------------+
+----------+
Figure 2: E.164 Number Assignment
(Note: Numbers above are "drama" numbers and are shown for
illustrative purpose only. Assignment polices for similar "real"
numbers in country code +44 may differ.)
As the Assignee (subscriber) data associated with an E.164 number is
the primary source of number assignment information, the NAE usually
holds the authoritative information required to confirm the
assignment.
A CSP that acts as NAE (indirect assignment) may therefore easily
assert the E.164 number assignment for its subscribers. In some
cases, such CSPs operate database(s) containing service information
on their subscribers' numbers.

Related

Examples of Custom Control Flow Compiling Words

Forth famously allows users to alter the language by defining new words for control flow (beyond those given by the standard: DO, LOOP, BEGIN, UNTIL, WHILE, REPEAT, LEAVE IF, THEN, ELSE, CASE, ENDCASE, etc.)
Are there common examples of people actually creating their own new control flow words? What are some typical and useful examples? Or has the standard already defined everything that people actually need?
I'm hoping to find examples of useful language extensions that have gained acceptance or proved generally helpful to make the language more expressive.
One another big direction of control flow structures in Forth is backtracking. It is very expressive and powerful mechanism. To be implemented, it requires return address manipulation [Gas99].
Backtracking in Forth was developed as BacFORTH extension by M.L.Gassananko in ~1988-1990. First papers on this topic was in Russian.
The technique of backtracking enables one to create abstract iterator
and filter modules responsible for looking over sets of all possible
values and rejecting "undue" ones [Gas96b].
For some introduction see the short description: Backtracking (by mlg), also the multi-threading in Forth? discussion in comp.lang.forth can be useful (see the messages from Gassanenko).
Just one example of generator in BacFORTH:
: (0-2)=> PRO 3 0 DO I CONT LOOP ; \ generator
: test (0-2)=> CR . ." : " (0-2)=> . ;
test CR
Output:
0 : 0 1 2
1 : 0 1 2
2 : 0 1 2
The PRO and CONT are special control flow words. PRO designates generator word, and CONT calls the consumer — it is something like yield in Ruby or ECMAScript. A number of other special words is also defined in BacFORTH.
You can play with BacFORTH in SP-Forth (just include ~profit/lib/bac4th.f library).
Etymology
In general, backtracking is just an algorithm for finding solutions. In Prolog this algorithm was embedded under the hood, so backtracking in Prolog is the process how it works themselves. Backtracking in BacFORTH is programming technique that is supported by a set of special control flow words.
References
[Gas96a] M.L. Gassanenko, Formalization of Backtracking in Forth, 1996 (mirror)
[Gas96b] M.L. Gassanenko, Enhancing the Capabilities of Backtracking, 1996 (mirror)
[Gas99] M.L. Gassanenko, The Open Interpreter Word Set, 1999
Here's one example. CASE was a somewhat late addition to the set of Forth control flow words. In early 1980, a competition for defining the best CASE statment was announced in Forth Dimensions. It was settled later that year with a tie between three entries. One of those ended up in the Forth94 standard.

Why doesn't scheme have primitive c data types like int, float etc

And, how does it allocate memory from the memory pool? How many bytes for symbols, numbers and how does it handle type-casting, since it doesn't have int and float types for conversions
I really tried researching on the internet, I'm sorry i have to ask here cause I found nothing.
Like other dynamically typed languages, Scheme does have types, but they're associated with values instead of with variables. This means you can assign a boolean to a variable at one point and a number at another point in time.
Scheme doesn't use C types, because a Scheme implementation isn't necessarily tied to C at all: several compilers emit native code, without going through C. And like the other answers mention, Scheme (and Lisp before it) tries to free the programmer from having to deal with such (usually) unimportant details as the target machine's register size.
Numeric types specifically are pretty sophisticated in Lisp variants. Scheme has the so-called numeric tower that abstracts away details of representation. Much like many "newer" languages such as Go, Python, and Ruby, Scheme will represent small integers (called "fixnums") in a machine register or word in memory. This means it'll be fast like in C, but it will automatically switch to a different representation once the integer exceeds that size, so that arbitrary large numbers can be represented without needing any special provisioning.
The other answers have already shown you the implementation details of some Schemes. I've recently blogged about CHICKEN Scheme's internal data representation. The post contains links to data representation of several other Schemes, and at the end you'll find further references to data representation in Python, Ruby, Perl and older Lisp variants.
The beauty of Lisp and Scheme is that these are such old languages, but they still contain "new ideas" that only now get added to other languages. Garbage collection pretty much had to be invented for Lisp to work, it supported a numeric tower for a long time, object orientation was added to it at a pretty early date, anonymous procedures were in there from the beginning I think, and closures were introduced by Scheme when its authors proved that lambda can be implemented as efficiently as goto.
All of this was invented between the 1950s and the 1980s. Meanwhile, it took a long long time before even garbage collection became accepted in the mainstream (basically with Java, so about 45 years), and general support for closures/anonymous procedures has become popular only in the last 5 years or so. Even tail call optimization isn't implemented in most languages; JavaScript programmers are only now discovering it. And how many "modern" languages still require the programmer to handle arbitrarily large integers using a separate set of operators and as a special type?
Note that a lot of these ideas (including the numeric type conversion you asked about) introduce additional overhead, but the overhead can be reduced by clever implementation techniques. And in the end most are a net win because they can improve programmer productivity. And if you need C or assembly performance in selected parts of your code, most implementations allow you to drop down to the metal through various tricks, so this is not closed off to you. The disadvantage would be that it isn't standardized (though there is cffi for Common Lisp), but like I said, Scheme isn't tied to C so it would be very rude if the spec enforced a C foreign function interface onto non-C implementations.
The answer to this question is implementation dependent.
Here is how it was done in the Scheme compiler workshop.
The compiler generated machine code for a 32-bit Sparc machine.
See http://www.cs.indiana.edu/eip/compile/back.html
Data Formats
All of our data are represented by 32-bit words, with the lower three bits as a kind of type-tag. While this would normally only allow us eight types, we cheat a little bit: Booleans, empty-lists and characters can be represented in (much) less than 32 bits, so we steal a few of their data bits for an ``extended'' type tag.
Numbers:
--------------------------------------
| 29-bit 2's complement integer 000 |
--------------------------------------
Booleans:
------------------- -------------------
#t: | ... 1 00000 001 | #f: | ... 0 00000 001 |
------------------- -------------------
Empty lists:
-----------------
| ... 00001 001 |
-----------------
Characters:
---------------------------------------
| ... 8-bit character data 00010 001 |
---------------------------------------
Pairs, strings, symbols, vectors and closures maintain a 3-bit type tag, but devote the rest of their 32 bits to an address into the heap where the actual value is stored:
Pairs:
--------------- -------------
| address 010 | --> | car | cdr |
-----\--------- / -------------
-----------
Strings:
--------------- -------------------------------------------------
| address 011 | --> | length | string data (may span many words)... |
-----\--------- / -------------------------------------------------
-----------
Symbols:
--------------- --------------------------
| address 100 | --> | symbol name (a string) |
-----\--------- / --------------------------
-----------
Vectors:
---------------
| address 101 |
-----|---------
v
-----------------------------------------------------------
| length | (v-ref 0) | (v-ref 1) | ... | (v-ref length-1) |
-----------------------------------------------------------
Closures:
---------------
| address 110 |
-----|---------
v
-----------------------------------------------------------------------
| length | code pointer | (free 0) | (free 1) | ... | (free length-1) |
-----------------------------------------------------------------------
The short answer is that it has primitive data types, but you as a programmer don't need to worry about it.
The designer of Lisp was from a math background and didn't use limitations in a specific platform as inspiration. In math a number isn't 32 bits but we do differentiate between exact numbers an inexact ones.
Scheme was originally interpreted in MacLisp and inherited the types and primitives of MacLisp. MacLisp is based on Lisp 1.5.
A variable doesn't have a type and most implementations have a machine pointer as it's data type. Primitives like chars, symbols and small integers are stored right in the address by manipulating the last significant bits as a type flag, which would always be zero for an actual object since the machine aligns objects in memory to register width.
If you add two integers that becomes bigger than the size of the result is of a different type. In C it would overflow.
;; This is Common Lisp, but the same happens in Scheme
(type-of 1) ; ==> BIT
(type-of 10) ; ==> (INTEGER 0 281474976710655)
(type-of 10000000000000000) ; ==> (INTEGER (281474976710655))
The type of the objects are different even though we treat them the same. The first two doesn't use any extra space than the pointer but the last is a pointer to an actual object that is allocated on the heap.
All of this is implementation dependent. The Scheme standard does not dictate how its done, but many does it just like this. You can read the standard and it says nothing about how to model numbers, only the behavior. You may make a R6RS Scheme that stores everything in byte arrays.

How can MARS produce weird constants in terms?

I've been reading about an interesting machine learning algorithm, MARS(Multi-variate adaptive regression splines).
As far as I understand the algorithm, from Wikipedia and Friedman's papers, it works in two stages, forward pass and backward pass. I'll ignore backward pass for now, since forward pass is the part I'm interested in. The steps for forward pass, as far as I can tell are.
Start with just the mean of the data.
Generate a new term pair, through exhaustive search
Repeat 2 while improvements are being made
And to generate a term pair MARS appears to do the following:
Select an existing term (e)
Select a variable (x)
Select a value of that variable (v)
Return two terms one of the form e*max(0,x-v) and the other of the form e*max(0, v-x)
And this makes sense to me. I could see how, for example, a data table like this:
+---+---+---+
| A | B | Z |
+---+---+---+
| 5 | 6 | 1 |
| 7 | 2 | 2 |
| 3 | 1 | 3 |
+---+---+---+
Could produce a terms like 2*max(0, B-1) or even 8*max(0, B-1)*max(3-A). However, the wikipedia page has an example that I don't understand. It has an ozone example where the first term is 25. However, it also has term in the final regression that has a coefficient that is negative and fractional. I don't see how this is possible, since the initial term is 5, and you can only multiply by previous terms, and no previous term can have a negative coefficient, that you could ever end up with one...
What am I missing?
As I see it, either I misunderstand term generation, or I misunderstand the simplification process. However, simplification as described seems to only delete terms, not modify them. Can you see what I am missing here?

What algorithms can I use to produce simple human-readable fault-tolerant strings?

Humans make mistakes.
When you require them to provide some unique generated ID identifying some entity.
For example:
Order A: has id ABC1234
Order B: has id BCD1235
They can make typos, they can provide string for ex: A123, B123, 1 2 3, "Order id B 12/3"
Then for automatic system its a challenge to identify the original ID.
My questions is are there any known algorithms/techniques. To generate a
-unique human readable ID (not sha or md5)
-with fault tolerance. That you can from a subset of character still decode the original id.
-case insensitive
A visual example of fault tolerance are QR codes, when some part of qr code damaged you can still read the message.
The goals is to avoid tools/algorithms like for ex. elastic search, levenstein and increase the chance to decode the original id even when the customer makes a typo, and reduce the chance that some other "original id" will be provided.
Aside from error correction, the interesting part of this question is whether there are codes designed specifically for humans to read and transcribe.
In RFC 3548, some considerations are made for avoiding the use of easily-confused characters in base32 coding (1 and L, 0 and o). Human-oriented base-32 encoding has some variations on that concept.
For audio, the PGP Word List is designed to give each byte a distinct word; it helps to protect against errors by having two lists of 256 words, one used for even bytes, the other for odd bytes (so a missing byte or swapped bytes can be detected).
There was a discussion here on SO about human friendly, pronounceable IDs which might be interesting, work on pronounceable passwords (like Diceware) is somewhat related.
Metafilter also had a discussion about codes that are easy for humans to copy that provides a few more interesting references.

The best String reconstruction algorithm out there ? (Best as in 'most accurate')

I've been searching and testing all kinds of string reconstruction algorithm i.e. reconstructing spaceless text into normal text.
My result posted here Solution working partially in Ruby, is working at 90% reconstruction for 2 or 3 words sentences, with a complete dictionary. But I can't get it to run better then this !
I think my algorithm inspired from dynamic programming is bad and contains a lot of patch work.
Can you propose another algorithm (in pseudo-code) that would work foolproof with a complete dictionary ?
You need more than just a dictionary, because you can have multiple possible phrases from the same spaceless string. For example, "themessobig" could be "the mess so big" or "themes so big" or "the mes so big", etc.
Those are all valid possibilities, but some are far more likely than others. Thus what you want to do is pick the most likely one given how the language is actually used. For this you need a huge corpus of text along with some NLP algorithms. Probably the most simple one is to count how likely a word is to occur after another word. So for "the mess so big", it's likelihood would be:
P(the | <START>) * P(mess | the) * P(so | mess) * P(big | so)
For "themes so big", the likelihood would be:
P(themes | <START>) * P(so | themes) * P(big | so)
Then you can pick the most likely of the possibilities. You can also construct triplets instead of tuples (e.g. P(so | the + mess)) which will require a bigger corpus to be effective.
This won't be foolproof but you can get better and better at it by having better corpuses or tweaking the algorithm.
With an unigram language model, which is essentially word frequencies,
it is possible to find the most probable segmentation of a string.
Example code from Russell & Norvig (2003, p. 837) (look for the function viterbi_segment)

Resources