Formatting large numbers using numeral.js - format

In my app, I want to format various numbers using a library, and I have a couple of related questions (which I don't submit separately because I think they might represent a very common set of problems)
Format a number using a format string constant to achieve compressed literals such as 1.2k or 1.23M
Format a number using a format string constant to have a thousand delimiter applied, ragardless of client's locale settings.
I tried to achieve a formatting result, where the language thousand delimiter is actually taken into consideration
http://jsfiddle.net/erbronni/19mLmekt/
// load a language
numeral.language('fr', {
delimiters: {
thousands: ' ',
decimal: ','
},
abbreviations: {
thousand: 'k',
million: 'M',
billion: '',
trillion: 't'
},
ordinal : function (number) {
return number === 1 ? 'er' : 'ème';
},
currency: {
symbol: '€'
}
});
numeral.language('fr');
document.getElementById('f1').innerHTML = numeral(12345678).format('0 000') // intended output: '12 345 678' -- does not seem to work

Numeral.js has this built in. It can be easily achieved using a such as .format('0.00a').
Some full examples:
numeral(1000000).format('0a') will return 1m
numeral(250500).format('0.0a') will return 250.5k
numeral(10500).format('0.00a') will return 10.50k

Related

Ruby: Why does unpack('Q') give a different result than manual conversion?

I'm trying to write a function that will .unpack('Q') (unpack to uint64_t) without access to the unpack method.
When I manually convert from string to binary to uint64, I get a different result than .unpack('Q'):
Integer('abcdefgh'.unpack('B*').first, 2) # => 7017280452245743464
'abcdefgh'.unpack('Q').first # => 7523094288207667809
I don't understand what's happening here.
I also don't understand why the output of .unpack('Q') is fixed regardless of the size of the input. If I add a thousand characters after 'abcdefgh' and then unpack('Q') it, I still just get [7523094288207667809]?
Byte order matters:
Integer('abcdefgh'.
each_char.
flat_map { |c| c.unpack('B*') }.
reverse.
join, 2)
#⇒ 7523094288207667809
'abcdefgh'.unpack('Q*').first
#⇒ 7523094288207667809
Your code produces the wrong result because after converting to binary, bytes should be reversed.
For the last part of your question, the reason the output of .unpack('Q') doesn't change with a longer input string is because the format is specifying a single 64-bit value so any characters after the first 8 are ignored. If you specified a format of Q2 and a 16 character string you'd decode 2 values:
> 'abcdefghihjklmno'.unpack('Q2')
=> [7523094288207667809, 8029475498074204265]
and again you'd find adding additional characters wouldn't change the result:
> 'abcdefghihjklmnofoofoo'.unpack('Q2')
=> [7523094288207667809, 8029475498074204265]
A format of Q* would return as many values as multiples of 64-bits were in the input:
> 'abcdefghihjklmnopqrstuvw'.unpack('Q*')
=> [7523094288207667809, 8029475498074204265, 8608196880778817904]
> 'abcdefghihjklmnopqrstuvwxyz'.unpack('Q*')
=> [7523094288207667809, 8029475498074204265, 8608196880778817904]

Extracting information from a scanned GS1-type barcode

I also want to determine product information such as the description, manufacturer and expiry date from a scanned GS1 barcode message.
How can I do that?
There are two processes involved in obtaining the information represented by a GS1-type barcode that stores data in GS1 Application Identifier Standard Format.
Extraction of the data fields (referred to as Application Identifiers) contained within the GS1-structured data obtained by scanning the symbol. This always includes a unique identifier for the item called a GTIN-14 and may include supplementary information such as an expiry date, LOT number, etc.
This process can be performed by a standalone application.
Lookup of the extracted GTIN in a database, either local to your application or via some public API, to provide a textual representation of the country of origin, manufacturer and possibly the item description.
To perform this process comprehensively an application requires access to external resources.
Background: GS1 Application Identifier Standard Format Composition
GS1-formatted data consists of a concatenated list of Application Identifiers (AIs) and values, beginning with AI (01) which represents the GTIN.
For example, the data "(01) 95012345678903 (10) 000123 (17) 150801" represents the following information:
GTIN: 95012345678903
BATCH/LOT: 000123
USE BY OR EXPIRY: 1st August 2015
Section 3: GS1 Application Identifier Definitions of the GS1 General Specifications provides the meaning of each of the Application Identifiers and importantly also states whether the AI values are by definition variable-length or fixed-length in which case the mandatory length is provided.
GS1 barcodes use a special non-data character (FNC1) both to indicate that the data conforms to GS1 Application Identifier standard format and to delimit the end of a variable-length data field from the next AI. For example, the above data could be encoded in a Code 128 symbol as {FNC1}019501234567890310000123{FNC1}17150801 to produce the following GS1-128 symbol:
When this symbol is read by a barcode scanner it is decoded as follows[†]:
019501234567890310000123{GS}17150801
Note that the initial FNC1 non-data character has been discarded and the FNC1 used in the variable-length AI separator role has been represented by a GS character (ASCII value 29).
Extraction (and optionally validation)
Extraction of the GTIN and any supplementary information can be performed directly by your application.
To extract the original Application Identifier data from the decoded GS1 symbol data from a barcode scanner requires that your application contains a data structure that we shall refer to as AI-TABLE mapping AI patterns to the length of their values as derived from the data provided in the section of the GS1 General Specifications linked to above:
AI | N (value length)
-------------------------
(00) | 18
(01) | 14
(10) | variable
(17) | 6
(240) | variable
(310n) | 6
(37) | variable
...
With this available you can proceed with AI-value extraction from the scanned barcode data as follows:
while more data:
AI,N = Entry from AI-TABLE matching a prefix of the data, otherwise FAIL.
if N is fixed-length:
VALUE = next N characters
else N is variable length:
VALUE = characters until "GS" or end of data
emit: (AI) VALUE
In practise you may choose to include more of the data from the General Specifications in your AI-TABLE to permit your application to perform enhanced validation of each VALUE's type and length. However the above is sufficient to extract the given data, such as AI (17) representing the expiry date which you are looking for.
Update August 2022: GS1 has recently released the GS1 Syntax Engine, a C library that is a reference implementation for processing GS1 Application Identifier syntax scan data: https://github.com/gs1/gs1-syntax-engine
Lookup
To obtain the remaining data that you are interested in (which is not directly encoded in the barcode) such as the item's name and manufacturer details requires that you look up the extracted GTIN using external resources such as a local product database or one of the public UPC database APIs that are available.
The GTIN itself contains a country of origin (actually it represents the national GS1 Member Organisation with which the manufacturer is registered, so not quite country of origin), manufacturer identifier – together these are referred to as the GS1 Prefix, are variable-length and are assigned by GS1 – and the remainder of the digits represent the product code which is assigned freely by the manufacturer.
Given a GTIN, some UPC databases will provide only details relating to the GS1 Prefix such as a textual representation of the GS1 Member Organisation and the manufacturer. Others attempt to maintain a record of individual GTIN assignments to common items, however this data will always be somewhat incomplete and out of date as there is no mandatory registry of real time GTIN assignments.
The answers to this question provide some examples of free product information platforms.
[†] In fact you might see ]C1019501234567890310000123{GS}17150801 in which case the leading symbology identifier for GS1-128 ]C1 can be discarded.
This is a solution written in Javascript proven in a specific customer, generalization requires more work:
//define AI's, parameter name and, optionally, transformation functions
SapApplicationIdentifiers= [
{ ai: '00', regex: /^00(\d{18})/, parameter: 'SSCC'},
{ ai: '01', regex: /^01(\d{14})/, parameter: 'EAN'},
{ ai: '02', regex: /^02(\d{14})/, parameter: 'EAN'},
{ ai: '10', regex: /^10([^\u001D]{1,20})/, parameter: 'LOTE'},
{ ai: '13', regex: /^13(\d{6})/},
{ ai: '15', regex: /^15(\d{6})/, parameter: 'F_CONS', transform: function(match){ return '20'+match[1].substr(0,2)+'-'+match[1].substr(2,2)+'-'+match[1].substr(4,2);}},
{ ai: '17', regex: /^17(\d{6})/, parameter: 'F_CONS', transform: function(match){ return '20'+match[1].substr(0,2)+'-'+match[1].substr(2,2)+'-'+match[1].substr(4,2);}},
{ ai: '19', regex: /^19(\d{6})/, parameter: 'F_CONS', transform: function(match){ return '20'+match[1].substr(0,2)+'-'+match[1].substr(2,2)+'-'+match[1].substr(4,2);}},
{ ai: '21', regex: /^21([\d\w]{1,20})/}, //numero de serie
{ ai: '30', regex: /^30(\d{1,8})/},
{ ai: '310', regex: /^310(\d)(\d{6})/, parameter: 'NTGEW', transform: function(match){ return parseInt( match[2] ) / Math.pow( 10,parseInt( match[1] ) )}},
{ ai: '320', regex: /^320(\d)(\d{6})/, parameter: 'NTGEW', transform: function(match){ return parseInt( match[2] ) / Math.pow( 10,parseInt( match[1] ) )}},
{ ai: '330', regex: /^330(\d)(\d{6})/},
{ ai: '37', regex: /^37(\d{1,8})/, parameter: 'CANT'}
];
//walks through the code, removing recognized fields
function parseAiByAi(code, mercancia, onError ){
var match;
if(!code)
return;
SapApplicationIdentifiers.forEach(function(AI){
if(code.indexOf(AI.ai)==0 && AI.regex.test(code)){
match= AI.regex.exec( code );
if(AI.parameter){
if(angular.isFunction(AI.transform)){
mercancia[AI.parameter] = AI.transform(match);
}else
mercancia[AI.parameter]= match[1];
if(AI.parameter=="NTGEW"){
mercancia.NTGEW_IA= AI.ai;
}
}
code= code.replace(match[0],'').replace(/^[\0\u001D]/,'');
parseAiByAi(code, mercancia, onError);
}
});
}
parseAiByAi(code, mercancia, onError);
You could try using the UPC Database API. They have no guarantee of uptime however and they limit to 1000 queries per day. I was also able to find this API which charges $1/1000 calls. Good luck!

Sort a table with UTF-8 encoded values alphabetically

I am storing dictionary entries in a Lua table, using it as an array. I want to sort the entries from Lua, so that I can add new ones without having to move to the correct position myself (which gets quite tedious soon). However, I am facing several problems:
Many words contain non-ASCII characters, which makes the built-in comparison operator for strings unsuitable for the task (for instance, it makes amputar come before ámbito).
There are words from various languages (all Western, though), namely Spanish, German and English. The problem here is that, probably, different languages have different notions of the alphabetical order. Since the main language is Spanish, I would like to use its rules, although I'm unsure as to whether that will work with characters not contained in the Spanish alphabet.
Some words contain capital letters, or, even worse, start with them. For example, all German nouns start with upper-case letters. By the built-in comparison operator, the capital letters come before their lower-case siblings, which is not my desired behaviour; I would like upper-case letters to be treated exactly as their lower-case counterparts.
Take, for example, the following table:
local entries =
{
'amputar',
'Volksgeist',
'ámbito'
}
Those entries should be ordered like this:
ámbito
amputar
Volksgeist
However, with my current code, the output is wrong:
local function compare_utf8_strings( o1 , o2 )
-- Using the built-in non-UTF-8-aware non-locale-aware string comparison operator
return o1 < o2
end
table.sort( entries , function ( a , b ) return compare_utf8_strings( a , b ) end )
for i, entry in ipairs(entries) do
print( entry )
end
That outputs:
Volksgeist
amputar
ámbito
Could you please take the following code, and hack it to fulfill my requirements?
local entries =
{
'amputar',
'Volksgeist',
'ámbito'
}
local function compare_utf8_strings( o1 , o2 )
-- Hack here, please, accomplishing my requirements
end
table.sort( entries , function ( a , b ) return compare_utf8_strings( a , b ) end )
for i, entry in ipairs(entries) do
print( entry )
end
It should output this:
ámbito
amputar
Volksgeist
As an additional requirement, this Lua code is all inside LuaTeX, which currently supports 5.2 version of the language. As for external libraries, I guess it's possible to use them.
I am a novice in the Lua camp, so, please, forgive any error I have made, and feel free to notify it, so I fix it.
After some time searching to no avail, I found this article by Joseph Wright. Although it touched my issue, it didn't provide a clear solution to follow. I asked him, and it turned out that there's currently no direct way to do what I want. He pointed out, however, that slnunicode comes built-in with LuaTeX (albeit it will be replaced in the future).
I developed a 'crude' solution using the facilities provided in the LuaTeX environment. It isn't elegant, but it works, and it doesn't pull any external dependencies. About its efficiency, I have not perceived any difference in the document build time.
-- Make the facilities available
unicode = require( 'unicode' )
utf8 = unicode.utf8
--[[
Each character's position in this array-like table determines its 'priority'.
Several characters in the same slot have the same 'priority'.
]]
local alphabet =
{
-- The space is here because of other requirements of my project
{ ' ' },
{ 'a', 'á', 'à', 'ä' },
{ 'b' },
{ 'c' },
{ 'd' },
{ 'e', 'é', 'è', 'ë' },
{ 'f' },
{ 'g' },
{ 'h' },
{ 'i', 'í', 'ì', 'ï' },
{ 'j' },
{ 'k' },
{ 'l' },
{ 'm' },
{ 'n' },
{ 'ñ' },
{ 'o', 'ó', 'ò', 'ö' },
{ 'p' },
{ 'q' },
{ 'r' },
{ 's' },
{ 't' },
{ 'u', 'ú', 'ù', 'ü' },
{ 'v' },
{ 'w' },
{ 'x' },
{ 'y' },
{ 'z' }
}
-- Looks up the character `character´ in the alphabet and returns its 'priority'
local function get_pos_in_alphabet( character )
for i, alphabet_entry in ipairs(alphabet) do
for _, alphabet_char in ipairs(alphabet_entry) do
if character == alphabet_char then
return i
end
end
end
--[[
If it isn't in the alphabet, abort: it's better than silently outputting some
random garbage, and, thanks to the message, allows to add the character to
the table.
]]
assert( false , "'" .. character .. "' was not in alphabet" )
end
-- Returns the characters in the UTF-8-encoded string `s´ in an array-like table
local function get_utf8_string_characters( s )
--[[
I saw this variable being used in several code snippets around the Web, but
it isn't provided in my LuaTeX environment; I use this form of initialization
to be safe if it's defined in the future.
]]
utf8.charpattern = utf8.charpattern or "([%z\1-\127\194-\244][\128-\191]*)"
local characters = {}
for character in s:gmatch(utf8.charpattern) do
table.insert( characters , character )
end
return characters
end
local function compare_utf8_strings( _o1 , _o2 )
--[[
`o1_chars´ and `o2_chars´ are array-like tables containing all of the
characters of each string, which are all made lower-case using the
slnunicode facilities that come built-in with LuaTeX.
]]
local o1_chars = get_utf8_string_characters( utf8.lower(_o1) )
local o2_chars = get_utf8_string_characters( utf8.lower(_o2) )
local o1_len = utf8.len(o1)
local o2_len = utf8.len(o2)
for i = 1, math.min( o1_len , o2_len ) do
o1_pos = get_pos_in_alphabet( o1_chars[i] )
o2_pos = get_pos_in_alphabet( o2_chars[i] )
if o1_pos > o2_pos then
return false
elseif o1_pos < o2_pos then
return true
end
end
return o1_len < o2_len
end
I cannot integrate this solution in the question's framework because my test environment, the ZeroBrane Studio Lua IDE, doesn't come with slnunicode and I don't know how to add it.
That was it. If anyone has any doubt or would like further explanations, please, use the comments. I hope it's useful to someone else.

Formatting y-axis

How can I format monetary values ​​of the y-axis of my graph bar to be so:
R$ 123.456,00
Instead of:
R$ 123,456.00
Currently I'm using this function to format, but can't make this simple change:
var format = d3.format(',.2f'); // Need to change this, but don't know how
chart.yAxis.tickFormat(function(d) {
return "R$ " + format(d);
});
I've already searched in D3 documentation, but can't find anything.
With d3 5.5 you can create a custom locale
https://github.com/d3/d3-format#formatLocale
Also note in the specifier(arg passed to .format) I now include a $ This will automatically include the currency prefix in the formatted string.
const customD3Locale = d3.formatLocale({
decimal: ",",
thousands: ".",
grouping: [3],
currency: ["R$",""]
})
const format = customD3Locale.format('$,.2f');
The format method don't seems to allow custom thousands and decimal separators. I think that you should replace the symbols yourself:
var format = d3.format(',.2f');
// Format the number, adding thousands and decimal separators
var label = format(1234.00);
// Replace the . and the , symbols. The ! symbol is necessary to do the swap
// it can be other symbol though
label = label.replace('.', '!');
label = label.replace(',', '.');
label = label.replace('!', ',');
// The result is 'R$ 1.234,00'
d3.select('#chart').append('p').text('R$ ' + label);
This jsfiddle have the replacement code.

International Phone Number Regex for Scraping [duplicate]

I'm trying to put together a comprehensive regex to validate phone numbers. Ideally it would handle international formats, but it must handle US formats, including the following:
1-234-567-8901
1-234-567-8901 x1234
1-234-567-8901 ext1234
1 (234) 567-8901
1.234.567.8901
1/234/567/8901
12345678901
I'll answer with my current attempt, but I'm hoping somebody has something better and/or more elegant.
Better option... just strip all non-digit characters on input (except 'x' and leading '+' signs), taking care because of the British tendency to write numbers in the non-standard form +44 (0) ... when asked to use the international prefix (in that specific case, you should discard the (0) entirely).
Then, you end up with values like:
12345678901
12345678901x1234
345678901x1234
12344678901
12345678901
12345678901
12345678901
+4112345678
+441234567890
Then when you display, reformat to your hearts content. e.g.
1 (234) 567-8901
1 (234) 567-8901 x1234
.*
If the users want to give you their phone numbers, then trust them to get it right. If they do not want to give it to you then forcing them to enter a valid number will either send them to a competitor's site or make them enter a random string that fits your regex. I might even be tempted to look up the number of a premium rate horoscope hotline and enter that instead.
I would also consider any of the following as valid entries on a web site:
"123 456 7890 until 6pm, then 098 765 4321"
"123 456 7890 or try my mobile on 098 765 4321"
"ex-directory - mind your own business"
It turns out that there's something of a spec for this, at least for North America, called the NANP.
You need to specify exactly what you want. What are legal delimiters? Spaces, dashes, and periods? No delimiter allowed? Can one mix delimiters (e.g., +0.111-222.3333)? How are extensions (e.g., 111-222-3333 x 44444) going to be handled? What about special numbers, like 911? Is the area code going to be optional or required?
Here's a regex for a 7 or 10 digit number, with extensions allowed, delimiters are spaces, dashes, or periods:
^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$
I would also suggest looking at the "libphonenumber" Google Library. I know it is not regex but it does exactly what you want.
For example, it will recognize that:
15555555555
is a possible number but not a valid number. It also supports countries outside the US.
Highlights of functionality:
Parsing/formatting/validating phone numbers for all countries/regions of the world.
getNumberType - gets the type of the number based on the number itself; able to distinguish Fixed-line, Mobile, Toll-free, Premium Rate, Shared Cost, VoIP and Personal Numbers (whenever feasible).
isNumberMatch - gets a confidence level on whether two numbers could be the same.
getExampleNumber/getExampleNumberByType - provides valid example numbers for all countries/regions, with the option of specifying which type of example phone number is needed.
isPossibleNumber - quickly guessing whether a number is a possible phonenumber by using only the length information, much faster than a full validation.
isValidNumber - full validation of a phone number for a region using length and prefix information.
AsYouTypeFormatter - formats phone numbers on-the-fly when users enter each digit.
findNumbers - finds numbers in text input.
PhoneNumberOfflineGeocoder - provides geographical information related to a phone number.
Examples
The biggest problem with phone number validation is it is very culturally dependant.
America
(408) 974–2042 is a valid US number
(999) 974–2042 is not a valid US number
Australia
0404 999 999 is a valid Australian number
(02) 9999 9999 is also a valid Australian number
(09) 9999 9999 is not a valid Australian number
A regular expression is fine for checking the format of a phone number, but it's not really going to be able to check the validity of a phone number.
I would suggest skipping a simple regular expression to test your phone number against, and using a library such as Google's libphonenumber (link to GitHub project).
Introducing libphonenumber!
Using one of your more complex examples, 1-234-567-8901 x1234, you get the following data out of libphonenumber (link to online demo):
Validation Results
Result from isPossibleNumber() true
Result from isValidNumber() true
Formatting Results:
E164 format +12345678901
Original format (234) 567-8901 ext. 123
National format (234) 567-8901 ext. 123
International format +1 234-567-8901 ext. 123
Out-of-country format from US 1 (234) 567-8901 ext. 123
Out-of-country format from CH 00 1 234-567-8901 ext. 123
So not only do you learn if the phone number is valid (which it is), but you also get consistent phone number formatting in your locale.
As a bonus, libphonenumber has a number of datasets to check the validity of phone numbers, as well, so checking a number such as +61299999999 (the international version of (02) 9999 9999) returns as a valid number with formatting:
Validation Results
Result from isPossibleNumber() true
Result from isValidNumber() true
Formatting Results
E164 format +61299999999
Original format 61 2 9999 9999
National format (02) 9999 9999
International format +61 2 9999 9999
Out-of-country format from US 011 61 2 9999 9999
Out-of-country format from CH 00 61 2 9999 9999
libphonenumber also gives you many additional benefits, such as grabbing the location that the phone number is detected as being, and also getting the time zone information from the phone number:
PhoneNumberOfflineGeocoder Results
Location Australia
PhoneNumberToTimeZonesMapper Results
Time zone(s) [Australia/Sydney]
But the invalid Australian phone number ((09) 9999 9999) returns that it is not a valid phone number.
Validation Results
Result from isPossibleNumber() true
Result from isValidNumber() false
Google's version has code for Java and Javascript, but people have also implemented libraries for other languages that use the Google i18n phone number dataset:
PHP: https://github.com/giggsey/libphonenumber-for-php
Python: https://github.com/daviddrysdale/python-phonenumbers
Ruby: https://github.com/sstephenson/global_phone
C#: https://github.com/twcclegg/libphonenumber-csharp
Objective-C: https://github.com/iziz/libPhoneNumber-iOS
JavaScript: https://github.com/ruimarinho/google-libphonenumber
Elixir: https://github.com/socialpaymentsbv/ex_phone_number
Unless you are certain that you are always going to be accepting numbers from one locale, and they are always going to be in one format, I would heavily suggest not writing your own code for this, and using libphonenumber for validating and displaying phone numbers.
/^(?:(?:\(?(?:00|\+)([1-4]\d\d|[1-9]\d+)\)?)[\-\.\ \\\/]?)?((?:\(?\d{1,}\)?[\-\.\ \\\/]?)+)(?:[\-\.\ \\\/]?(?:#|ext\.?|extension|x)[\-\.\ \\\/]?(\d+))?$/i
This matches:
- (+351) 282 43 50 50
- 90191919908
- 555-8909
- 001 6867684
- 001 6867684x1
- 1 (234) 567-8901
- 1-234-567-8901 x1234
- 1-234-567-8901 ext1234
- 1-234 567.89/01 ext.1234
- 1(234)5678901x1234
- (123)8575973
- (0055)(123)8575973
On $n, it saves:
Country indicator
Phone number
Extension
You can test it on https://regex101.com/r/kFzb1s/1
Although the answer to strip all whitespace is neat, it doesn't really solve the problem that's posed, which is to find a regex. Take, for instance, my test script that downloads a web page and extracts all phone numbers using the regex. Since you'd need a regex anyway, you might as well have the regex do all the work. I came up with this:
1?\W*([2-9][0-8][0-9])\W*([2-9][0-9]{2})\W*([0-9]{4})(\se?x?t?(\d*))?
Here's a perl script to test it. When you match, $1 contains the area code, $2 and $3 contain the phone number, and $5 contains the extension. My test script downloads a file from the internet and prints all the phone numbers in it.
#!/usr/bin/perl
my $us_phone_regex =
'1?\W*([2-9][0-8][0-9])\W*([2-9][0-9]{2})\W*([0-9]{4})(\se?x?t?(\d*))?';
my #tests =
(
"1-234-567-8901",
"1-234-567-8901 x1234",
"1-234-567-8901 ext1234",
"1 (234) 567-8901",
"1.234.567.8901",
"1/234/567/8901",
"12345678901",
"not a phone number"
);
foreach my $num (#tests)
{
if( $num =~ m/$us_phone_regex/ )
{
print "match [$1-$2-$3]\n" if not defined $4;
print "match [$1-$2-$3 $5]\n" if defined $4;
}
else
{
print "no match [$num]\n";
}
}
#
# Extract all phone numbers from an arbitrary file.
#
my $external_filename =
'http://web.textfiles.com/ezines/PHREAKSANDGEEKS/PnG-spring05.txt';
my #external_file = `curl $external_filename`;
foreach my $line (#external_file)
{
if( $line =~ m/$us_phone_regex/ )
{
print "match $1 $2 $3\n";
}
}
Edit:
You can change \W* to \s*\W?\s* in the regex to tighten it up a bit. I wasn't thinking of the regex in terms of, say, validating user input on a form when I wrote it, but this change makes it possible to use the regex for that purpose.
'1?\s*\W?\s*([2-9][0-8][0-9])\s*\W?\s*([2-9][0-9]{2})\s*\W?\s*([0-9]{4})(\se?x?t?(\d*))?';
I answered this question on another SO question before deciding to also include my answer as an answer on this thread, because no one was addressing how to require/not require items, just handing out regexs:
Regex working wrong, matching unexpected things
From my post on that site, I've created a quick guide to assist anyone with making their own regex for their own desired phone number format, which I will caveat (like I did on the other site) that if you are too restrictive, you may not get the desired results, and there is no "one size fits all" solution to accepting all possible phone numbers in the world - only what you decide to accept as your format of choice. Use at your own risk.
Quick cheat sheet
Start the expression: /^
If you want to require a space, use: [\s] or \s
If you want to require parenthesis, use: [(] and [)] . Using \( and \) is ugly and can make things confusing.
If you want anything to be optional, put a ? after it
If you want a hyphen, just type - or [-] . If you do not put it first or last in a series of other characters, though, you may need to escape it: \-
If you want to accept different choices in a slot, put brackets around the options: [-.\s] will require a hyphen, period, or space. A question mark after the last bracket will make all of those optional for that slot.
\d{3} : Requires a 3-digit number: 000-999. Shorthand for
[0-9][0-9][0-9].
[2-9] : Requires a digit 2-9 for that slot.
(\+|1\s)? : Accept a "plus" or a 1 and a space (pipe character, |, is "or"), and make it optional. The "plus" sign must be escaped.
If you want specific numbers to match a slot, enter them: [246] will require a 2, 4, or 6. (?:77|78) or [77|78] will require 77 or 78.
$/ : End the expression
I wrote simpliest (although i didn't need dot in it).
^([0-9\(\)\/\+ \-]*)$
As mentioned below, it checks only for characters, not its structure/order
Note that stripping () characters does not work for a style of writing UK numbers that is common: +44 (0) 1234 567890 which means dial either the international number:
+441234567890
or in the UK dial 01234567890
If you just want to verify you don't have random garbage in the field (i.e., from form spammers) this regex should do nicely:
^[0-9+\(\)#\.\s\/ext-]+$
Note that it doesn't have any special rules for how many digits, or what numbers are valid in those digits, it just verifies that only digits, parenthesis, dashes, plus, space, pound, asterisk, period, comma, or the letters e, x, t are present.
It should be compatible with international numbers and localization formats. Do you foresee any need to allow square, curly, or angled brackets for some regions? (currently they aren't included).
If you want to maintain per digit rules (such as in US Area Codes and Prefixes (exchange codes) must fall in the range of 200-999) well, good luck to you. Maintaining a complex rule-set which could be outdated at any point in the future by any country in the world does not sound fun.
And while stripping all/most non-numeric characters may work well on the server side (especially if you are planning on passing these values to a dialer), you may not want to thrash the user's input during validation, particularly if you want them to make corrections in another field.
Here's a wonderful pattern that most closely matched the validation that I needed to achieve. I'm not the original author, but I think it's well worth sharing as I found this problem to be very complex and without a concise or widely useful answer.
The following regex will catch widely used number and character combinations in a variety of global phone number formats:
/^\s*(?:\+?(\d{1,3}))?([-. (]*(\d{3})[-. )]*)?((\d{3})[-. ]*(\d{2,4})(?:[-.x ]*(\d+))?)\s*$/gm
Positive:
+42 555.123.4567
+1-(800)-123-4567
+7 555 1234567
+7(926)1234567
(926) 1234567
+79261234567
926 1234567
9261234567
1234567
123-4567
123-89-01
495 1234567
469 123 45 67
89261234567
8 (926) 1234567
926.123.4567
415-555-1234
650-555-2345
(416)555-3456
202 555 4567
4035555678
1 416 555 9292
Negative:
926 3 4
8 800 600-APPLE
Original source: http://www.regexr.com/38pvb
Have you had a look over at RegExLib?
Entering US phone number brought back quite a list of possibilities.
My attempt at an unrestrictive regex:
/^[+#*\(\)\[\]]*([0-9][ ext+-pw#*\(\)\[\]]*){6,45}$/
Accepts:
+(01) 123 (456) 789 ext555
123456
*44 123-456-789 [321]
123456
123456789012345678901234567890123456789012345
*****++[](][((( 123456tteexxttppww
Rejects:
mob 07777 777777
1234 567 890 after 5pm
john smith
(empty)
1234567890123456789012345678901234567890123456
911
It is up to you to sanitize it for display. After validating it could be a number though.
I found this to work quite well:
^\(*\+*[1-9]{0,3}\)*-*[1-9]{0,3}[-. /]*\(*[2-9]\d{2}\)*[-. /]*\d{3}[-. /]*\d{4} *e*x*t*\.* *\d{0,4}$
It works for these number formats:
1-234-567-8901
1-234-567-8901 x1234
1-234-567-8901 ext1234
1 (234) 567-8901
1.234.567.8901
1/234/567/8901
12345678901
1-234-567-8901 ext. 1234
(+351) 282 433 5050
Make sure to use global AND multiline flags to make sure.
Link: http://www.regexr.com/3bp4b
Here's my best try so far. It handles the formats above but I'm sure I'm missing some other possible formats.
^\d?(?:(?:[\+]?(?:[\d]{1,3}(?:[ ]+|[\-.])))?[(]?(?:[\d]{3})[\-/)]?(?:[ ]+)?)?(?:[a-zA-Z2-9][a-zA-Z0-9 \-.]{6,})(?:(?:[ ]+|[xX]|(i:ext[\.]?)){1,2}(?:[\d]{1,5}))?$
This is a simple Regular Expression pattern for Philippine Mobile Phone Numbers:
((\+[0-9]{2})|0)[.\- ]?9[0-9]{2}[.\- ]?[0-9]{3}[.\- ]?[0-9]{4}
or
((\+63)|0)[.\- ]?9[0-9]{2}[.\- ]?[0-9]{3}[.\- ]?[0-9]{4}
will match these:
+63.917.123.4567
+63-917-123-4567
+63 917 123 4567
+639171234567
09171234567
The first one will match ANY two digit country code, while the second one will match the Philippine country code exclusively.
Test it here: http://refiddle.com/1ox
If you're talking about form validation, the regexp to validate correct meaning as well as correct data is going to be extremely complex because of varying country and provider standards. It will also be hard to keep up to date.
I interpret the question as looking for a broadly valid pattern, which may not be internally consistent - for example having a valid set of numbers, but not validating that the trunk-line, exchange, etc. to the valid pattern for the country code prefix.
North America is straightforward, and for international I prefer to use an 'idiomatic' pattern which covers the ways in which people specify and remember their numbers:
^((((\(\d{3}\))|(\d{3}-))\d{3}-\d{4})|(\+?\d{2}((-| )\d{1,8}){1,5}))(( x| ext)\d{1,5}){0,1}$
The North American pattern makes sure that if one parenthesis is included both are. The international accounts for an optional initial '+' and country code. After that, you're in the idiom. Valid matches would be:
(xxx)xxx-xxxx
(xxx)-xxx-xxxx
(xxx)xxx-xxxx x123
12 1234 123 1 x1111
12 12 12 12 12
12 1 1234 123456 x12345
+12 1234 1234
+12 12 12 1234
+12 1234 5678
+12 12345678
This may be biased as my experience is limited to North America, Europe and a small bit of Asia.
My gut feeling is reinforced by the amount of replies to this topic - that there is a virtually infinite number of solutions to this problem, none of which are going to be elegant.
Honestly, I would recommend you don't try to validate phone numbers. Even if you could write a big, hairy validator that would allow all the different legitimate formats, it would end up allowing pretty much anything even remotely resembling a phone number in the first place.
In my opinion, the most elegant solution is to validate a minimum length, nothing more.
You'll have a hard time dealing with international numbers with a single/simple regex, see this post on the difficulties of international (and even north american) phone numbers.
You'll want to parse the first few digits to determine what the country code is, then act differently based on the country.
Beyond that - the list you gave does not include another common US format - leaving off the initial 1. Most cell phones in the US don't require it, and it'll start to baffle the younger generation unless they've dialed internationally.
You've correctly identified that it's a tricky problem...
-Adam
After reading through these answers, it looks like there wasn't a straightforward regular expression that can parse through a bunch of text and pull out phone numbers in any format (including international with and without the plus sign).
Here's what I used for a client project recently, where we had to convert all phone numbers in any format to tel: links.
So far, it's been working with everything they've thrown at it, but if errors come up, I'll update this answer.
Regex:
/(\+*\d{1,})*([ |\(])*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})/
PHP function to replace all phone numbers with tel: links (in case anyone is curious):
function phoneToTel($number) {
$return = preg_replace('/(\+*\d{1,})*([ |\(])*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})/', '$1 ($3) $4-$5', $number); // includes international
return $return;
}
I believe the Number::Phone::US and Regexp::Common (particularly the source of Regexp::Common::URI::RFC2806) Perl modules could help.
The question should probably be specified in a bit more detail to explain the purpose of validating the numbers. For instance, 911 is a valid number in the US, but 911x isn't for any value of x. That's so that the phone company can calculate when you are done dialing. There are several variations on this issue. But your regex doesn't check the area code portion, so that doesn't seem to be a concern.
Like validating email addresses, even if you have a valid result you can't know if it's assigned to someone until you try it.
If you are trying to validate user input, why not normalize the result and be done with it? If the user puts in a number you can't recognize as a valid number, either save it as inputted or strip out undailable characters. The Number::Phone::Normalize Perl module could be a source of inspiration.
Do a replace on formatting characters, then check the remaining for phone validity. In PHP,
$replace = array( ' ', '-', '/', '(', ')', ',', '.' ); //etc; as needed
preg_match( '/1?[0-9]{10}((ext|x)[0-9]{1,4})?/i', str_replace( $replace, '', $phone_num );
Breaking a complex regexp like this can be just as effective, but much more simple.
I work for a market research company and we have to filter these types of input alllll the time. You're complicating it too much. Just strip the non-alphanumeric chars, and see if there's an extension.
For further analysis you can subscribe to one of many providers that will give you access to a database of valid numbers as well as tell you if they're landlines or mobiles, disconnected, etc. It costs money.
I found this to be something interesting. I have not tested it but it looks as if it would work
<?php
/*
string validate_telephone_number (string $number, array $formats)
*/
function validate_telephone_number($number, $formats)
{
$format = trim(ereg_replace("[0-9]", "#", $number));
return (in_array($format, $formats)) ? true : false;
}
/* Usage Examples */
// List of possible formats: You can add new formats or modify the existing ones
$formats = array('###-###-####', '####-###-###',
'(###) ###-###', '####-####-####',
'##-###-####-####', '####-####', '###-###-###',
'#####-###-###', '##########');
$number = '08008-555-555';
if(validate_telephone_number($number, $formats))
{
echo $number.' is a valid phone number.';
}
echo "<br />";
$number = '123-555-555';
if(validate_telephone_number($number, $formats))
{
echo $number.' is a valid phone number.';
}
echo "<br />";
$number = '1800-1234-5678';
if(validate_telephone_number($number, $formats))
{
echo $number.' is a valid phone number.';
}
echo "<br />";
$number = '(800) 555-123';
if(validate_telephone_number($number, $formats))
{
echo $number.' is a valid phone number.';
}
echo "<br />";
$number = '1234567890';
if(validate_telephone_number($number, $formats))
{
echo $number.' is a valid phone number.';
}
?>
You would probably be better off using a Masked Input for this. That way users can ONLY enter numbers and you can format however you see fit. I'm not sure if this is for a web application, but if it is there is a very click jQuery plugin that offers some options for doing this.
http://digitalbush.com/projects/masked-input-plugin/
They even go over how to mask phone number inputs in their tutorial.
Here's one that works well in JavaScript. It's in a string because that's what the Dojo widget was expecting.
It matches a 10 digit North America NANP number with optional extension. Spaces, dashes and periods are accepted delimiters.
"^(\\(?\\d\\d\\d\\)?)( |-|\\.)?\\d\\d\\d( |-|\\.)?\\d{4,4}(( |-|\\.)?[ext\\.]+ ?\\d+)?$"
I was struggling with the same issue, trying to make my application future proof, but these guys got me going in the right direction. I'm not actually checking the number itself to see if it works or not, I'm just trying to make sure that a series of numbers was entered that may or may not have an extension.
Worst case scenario if the user had to pull an unformatted number from the XML file, they would still just type the numbers into the phone's numberpad 012345678x5, no real reason to keep it pretty. That kind of RegEx would come out something like this for me:
\d+ ?\w{0,9} ?\d+
01234467 extension 123456
01234567x123456
01234567890
My inclination is to agree that stripping non-digits and just accepting what's there is best. Maybe to ensure at least a couple digits are present, although that does prohibit something like an alphabetic phone number "ASK-JAKE" for example.
A couple simple perl expressions might be:
#f = /(\d+)/g;
tr/0-9//dc;
Use the first one to keep the digit groups together, which may give formatting clues. Use the second one to trivially toss all non-digits.
Is it a worry that there may need to be a pause and then more keys entered? Or something like 555-1212 (wait for the beep) 123?
pattern="^[\d|\+|\(]+[\)|\d|\s|-]*[\d]$"
validateat="onsubmit"
Must end with a digit, can begin with ( or + or a digit, and may contain + - ( or )
For anyone interested in doing something similar with Irish mobile phone numbers, here's a straightforward way of accomplishing it:
http://ilovenicii.com/?p=87
PHP
<?php
$pattern = "/^(083|086|085|086|087)\d{7}$/";
$phone = "087343266";
if (preg_match($pattern,$phone)) echo "Match";
else echo "Not match";
There is also a JQuery solution on that link.
EDIT:
jQuery solution:
$(function(){
//original field values
var field_values = {
//id : value
'url' : 'url',
'yourname' : 'yourname',
'email' : 'email',
'phone' : 'phone'
};
var url =$("input#url").val();
var yourname =$("input#yourname").val();
var email =$("input#email").val();
var phone =$("input#phone").val();
//inputfocus
$('input#url').inputfocus({ value: field_values['url'] });
$('input#yourname').inputfocus({ value: field_values['yourname'] });
$('input#email').inputfocus({ value: field_values['email'] });
$('input#phone').inputfocus({ value: field_values['phone'] });
//reset progress bar
$('#progress').css('width','0');
$('#progress_text').html('0% Complete');
//first_step
$('form').submit(function(){ return false; });
$('#submit_first').click(function(){
//remove classes
$('#first_step input').removeClass('error').removeClass('valid');
//ckeck if inputs aren't empty
var fields = $('#first_step input[type=text]');
var error = 0;
fields.each(function(){
var value = $(this).val();
if( value.length<12 || value==field_values[$(this).attr('id')] ) {
$(this).addClass('error');
$(this).effect("shake", { times:3 }, 50);
error++;
} else {
$(this).addClass('valid');
}
});
if(!error) {
if( $('#password').val() != $('#cpassword').val() ) {
$('#first_step input[type=password]').each(function(){
$(this).removeClass('valid').addClass('error');
$(this).effect("shake", { times:3 }, 50);
});
return false;
} else {
//update progress bar
$('#progress_text').html('33% Complete');
$('#progress').css('width','113px');
//slide steps
$('#first_step').slideUp();
$('#second_step').slideDown();
}
} else return false;
});
//second section
$('#submit_second').click(function(){
//remove classes
$('#second_step input').removeClass('error').removeClass('valid');
var emailPattern = /^[a-zA-Z0-9._-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$/;
var fields = $('#second_step input[type=text]');
var error = 0;
fields.each(function(){
var value = $(this).val();
if( value.length<1 || value==field_values[$(this).attr('id')] || ( $(this).attr('id')=='email' && !emailPattern.test(value) ) ) {
$(this).addClass('error');
$(this).effect("shake", { times:3 }, 50);
error++;
} else {
$(this).addClass('valid');
}
function validatePhone(phone) {
var a = document.getElementById(phone).value;
var filter = /^[0-9-+]+$/;
if (filter.test(a)) {
return true;
}
else {
return false;
}
}
$('#phone').blur(function(e) {
if (validatePhone('txtPhone')) {
$('#spnPhoneStatus').html('Valid');
$('#spnPhoneStatus').css('color', 'green');
}
else {
$('#spnPhoneStatus').html('Invalid');
$('#spnPhoneStatus').css('color', 'red');
}
});
});
if(!error) {
//update progress bar
$('#progress_text').html('66% Complete');
$('#progress').css('width','226px');
//slide steps
$('#second_step').slideUp();
$('#fourth_step').slideDown();
} else return false;
});
$('#submit_second').click(function(){
//update progress bar
$('#progress_text').html('100% Complete');
$('#progress').css('width','339px');
//prepare the fourth step
var fields = new Array(
$('#url').val(),
$('#yourname').val(),
$('#email').val(),
$('#phone').val()
);
var tr = $('#fourth_step tr');
tr.each(function(){
//alert( fields[$(this).index()] )
$(this).children('td:nth-child(2)').html(fields[$(this).index()]);
});
//slide steps
$('#third_step').slideUp();
$('#fourth_step').slideDown();
});
$('#submit_fourth').click(function(){
url =$("input#url").val();
yourname =$("input#yourname").val();
email =$("input#email").val();
phone =$("input#phone").val();
//send information to server
var dataString = 'url='+ url + '&yourname=' + yourname + '&email=' + email + '&phone=' + phone;
alert (dataString);//return false;
$.ajax({
type: "POST",
url: "http://clients.socialnetworkingsolutions.com/infobox/contact/",
data: "url="+url+"&yourname="+yourname+"&email="+email+'&phone=' + phone,
cache: false,
success: function(data) {
console.log("form submitted");
alert("success");
}
});
return false;
});
//back button
$('.back').click(function(){
var container = $(this).parent('div'),
previous = container.prev();
switch(previous.attr('id')) {
case 'first_step' : $('#progress_text').html('0% Complete');
$('#progress').css('width','0px');
break;
case 'second_step': $('#progress_text').html('33% Complete');
$('#progress').css('width','113px');
break;
case 'third_step' : $('#progress_text').html('66% Complete');
$('#progress').css('width','226px');
break;
default: break;
}
$(container).slideUp();
$(previous).slideDown();
});
});
Source.

Resources