Does anyone have a good Proper Case algorithm - algorithm

Does anyone have a trusted Proper Case or PCase algorithm (similar to a UCase or Upper)? I'm looking for something that takes a value such as "GEORGE BURDELL" or "george burdell" and turns it into "George Burdell".
I have a simple one that handles the simple cases. The ideal would be to have something that can handle things such as "O'REILLY" and turn it into "O'Reilly", but I know that is tougher.
I am mainly focused on the English language if that simplifies things.
UPDATE: I'm using C# as the language, but I can convert from almost anything (assuming like functionality exists).
I agree that the McDonald's scneario is a tough one. I meant to mention that along with my O'Reilly example, but did not in the original post.

Unless I've misunderstood your question I don't think you need to roll your own, the TextInfo class can do it for you.
using System.Globalization;
CultureInfo.InvariantCulture.TextInfo.ToTitleCase("GeOrGE bUrdEll")
Will return "George Burdell. And you can use your own culture if there's some special rules involved.
Update: Michael (in a comment to this answer) pointed out that this will not work if the input is all caps since the method will assume that it is an acronym. The naive workaround for this is to .ToLower() the text before submitting it to ToTitleCase.

#zwol: I'll post it as a separate reply.
Here's an example based on ljs's post.
void Main()
{
List<string> names = new List<string>() {
"bill o'reilly",
"johannes diderik van der waals",
"mr. moseley-williams",
"Joe VanWyck",
"mcdonald's",
"william the third",
"hrh prince charles",
"h.r.m. queen elizabeth the third",
"william gates, iii",
"pope leo xii",
"a.k. jennings"
};
names.Select(name => name.ToProperCase()).Dump();
}
// http://stackoverflow.com/questions/32149/does-anyone-have-a-good-proper-case-algorithm
public static class ProperCaseHelper
{
public static string ToProperCase(this string input)
{
if (IsAllUpperOrAllLower(input))
{
// fix the ALL UPPERCASE or all lowercase names
return string.Join(" ", input.Split(' ').Select(word => wordToProperCase(word)));
}
else
{
// leave the CamelCase or Propercase names alone
return input;
}
}
public static bool IsAllUpperOrAllLower(this string input)
{
return (input.ToLower().Equals(input) || input.ToUpper().Equals(input));
}
private static string wordToProperCase(string word)
{
if (string.IsNullOrEmpty(word)) return word;
// Standard case
string ret = capitaliseFirstLetter(word);
// Special cases:
ret = properSuffix(ret, "'"); // D'Artagnon, D'Silva
ret = properSuffix(ret, "."); // ???
ret = properSuffix(ret, "-"); // Oscar-Meyer-Weiner
ret = properSuffix(ret, "Mc", t => t.Length > 4); // Scots
ret = properSuffix(ret, "Mac", t => t.Length > 5); // Scots except Macey
// Special words:
ret = specialWords(ret, "van"); // Dick van Dyke
ret = specialWords(ret, "von"); // Baron von Bruin-Valt
ret = specialWords(ret, "de");
ret = specialWords(ret, "di");
ret = specialWords(ret, "da"); // Leonardo da Vinci, Eduardo da Silva
ret = specialWords(ret, "of"); // The Grand Old Duke of York
ret = specialWords(ret, "the"); // William the Conqueror
ret = specialWords(ret, "HRH"); // His/Her Royal Highness
ret = specialWords(ret, "HRM"); // His/Her Royal Majesty
ret = specialWords(ret, "H.R.H."); // His/Her Royal Highness
ret = specialWords(ret, "H.R.M."); // His/Her Royal Majesty
ret = dealWithRomanNumerals(ret); // William Gates, III
return ret;
}
private static string properSuffix(string word, string prefix, Func<string, bool> condition = null)
{
if (string.IsNullOrEmpty(word)) return word;
if (condition != null && ! condition(word)) return word;
string lowerWord = word.ToLower();
string lowerPrefix = prefix.ToLower();
if (!lowerWord.Contains(lowerPrefix)) return word;
int index = lowerWord.IndexOf(lowerPrefix);
// If the search string is at the end of the word ignore.
if (index + prefix.Length == word.Length) return word;
return word.Substring(0, index) + prefix +
capitaliseFirstLetter(word.Substring(index + prefix.Length));
}
private static string specialWords(string word, string specialWord)
{
if (word.Equals(specialWord, StringComparison.InvariantCultureIgnoreCase))
{
return specialWord;
}
else
{
return word;
}
}
private static string dealWithRomanNumerals(string word)
{
// Roman Numeral parser thanks to [djk](https://stackoverflow.com/users/785111/djk)
// Note that it excludes the Chinese last name Xi
return new Regex(#"\b(?!Xi\b)(X|XX|XXX|XL|L|LX|LXX|LXXX|XC|C)?(I|II|III|IV|V|VI|VII|VIII|IX)?\b", RegexOptions.IgnoreCase).Replace(word, match => match.Value.ToUpperInvariant());
}
private static string capitaliseFirstLetter(string word)
{
return char.ToUpper(word[0]) + word.Substring(1).ToLower();
}
}

There's also this neat Perl script for title-casing text.
http://daringfireball.net/2008/08/title_case_update
#!/usr/bin/perl
# This filter changes all words to Title Caps, and attempts to be clever
# about *un*capitalizing small words like a/an/the in the input.
#
# The list of "small words" which are not capped comes from
# the New York Times Manual of Style, plus 'vs' and 'v'.
#
# 10 May 2008
# Original version by John Gruber:
# http://daringfireball.net/2008/05/title_case
#
# 28 July 2008
# Re-written and much improved by Aristotle Pagaltzis:
# http://plasmasturm.org/code/titlecase/
#
# Full change log at __END__.
#
# License: http://www.opensource.org/licenses/mit-license.php
#
use strict;
use warnings;
use utf8;
use open qw( :encoding(UTF-8) :std );
my #small_words = qw( (?<!q&)a an and as at(?!&t) but by en for if in of on or the to v[.]? via vs[.]? );
my $small_re = join '|', #small_words;
my $apos = qr/ (?: ['’] [[:lower:]]* )? /x;
while ( <> ) {
s{\A\s+}{}, s{\s+\z}{};
$_ = lc $_ if not /[[:lower:]]/;
s{
\b (_*) (?:
( (?<=[ ][/\\]) [[:alpha:]]+ [-_[:alpha:]/\\]+ | # file path or
[-_[:alpha:]]+ [#.:] [-_[:alpha:]#.:/]+ $apos ) # URL, domain, or email
|
( (?i: $small_re ) $apos ) # or small word (case-insensitive)
|
( [[:alpha:]] [[:lower:]'’()\[\]{}]* $apos ) # or word w/o internal caps
|
( [[:alpha:]] [[:alpha:]'’()\[\]{}]* $apos ) # or some other word
) (_*) \b
}{
$1 . (
defined $2 ? $2 # preserve URL, domain, or email
: defined $3 ? "\L$3" # lowercase small word
: defined $4 ? "\u\L$4" # capitalize word w/o internal caps
: $5 # preserve other kinds of word
) . $6
}xeg;
# Exceptions for small words: capitalize at start and end of title
s{
( \A [[:punct:]]* # start of title...
| [:.;?!][ ]+ # or of subsentence...
| [ ]['"“‘(\[][ ]* ) # or of inserted subphrase...
( $small_re ) \b # ... followed by small word
}{$1\u\L$2}xig;
s{
\b ( $small_re ) # small word...
(?= [[:punct:]]* \Z # ... at the end of the title...
| ['"’”)\]] [ ] ) # ... or of an inserted subphrase?
}{\u\L$1}xig;
# Exceptions for small words in hyphenated compound words
## e.g. "in-flight" -> In-Flight
s{
\b
(?<! -) # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (in-flight)
( $small_re )
(?= -[[:alpha:]]+) # lookahead for "-someword"
}{\u\L$1}xig;
## # e.g. "Stand-in" -> "Stand-In" (Stand is already capped at this point)
s{
\b
(?<!…) # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (stand-in)
( [[:alpha:]]+- ) # $1 = first word and hyphen, should already be properly capped
( $small_re ) # ... followed by small word
(?! - ) # Negative lookahead for another '-'
}{$1\u$2}xig;
print "$_";
}
__END__
But it sounds like by proper case you mean.. for people's names only.

I did a quick C# port of https://github.com/tamtamchik/namecase, which is based on Lingua::EN::NameCase.
public static class CIQNameCase
{
static Dictionary<string, string> _exceptions = new Dictionary<string, string>
{
{#"\bMacEdo" ,"Macedo"},
{#"\bMacEvicius" ,"Macevicius"},
{#"\bMacHado" ,"Machado"},
{#"\bMacHar" ,"Machar"},
{#"\bMacHin" ,"Machin"},
{#"\bMacHlin" ,"Machlin"},
{#"\bMacIas" ,"Macias"},
{#"\bMacIulis" ,"Maciulis"},
{#"\bMacKie" ,"Mackie"},
{#"\bMacKle" ,"Mackle"},
{#"\bMacKlin" ,"Macklin"},
{#"\bMacKmin" ,"Mackmin"},
{#"\bMacQuarie" ,"Macquarie"}
};
static Dictionary<string, string> _replacements = new Dictionary<string, string>
{
{#"\bAl(?=\s+\w)" , #"al"}, // al Arabic or forename Al.
{#"\b(Bin|Binti|Binte)\b" , #"bin"}, // bin, binti, binte Arabic
{#"\bAp\b" , #"ap"}, // ap Welsh.
{#"\bBen(?=\s+\w)" , #"ben"}, // ben Hebrew or forename Ben.
{#"\bDell([ae])\b" , #"dell$1"}, // della and delle Italian.
{#"\bD([aeiou])\b" , #"d$1"}, // da, de, di Italian; du French; do Brasil
{#"\bD([ao]s)\b" , #"d$1"}, // das, dos Brasileiros
{#"\bDe([lrn])\b" , #"de$1"}, // del Italian; der/den Dutch/Flemish.
{#"\bEl\b" , #"el"}, // el Greek or El Spanish.
{#"\bLa\b" , #"la"}, // la French or La Spanish.
{#"\bL([eo])\b" , #"l$1"}, // lo Italian; le French.
{#"\bVan(?=\s+\w)" , #"van"}, // van German or forename Van.
{#"\bVon\b" , #"von"} // von Dutch/Flemish
};
static string[] _conjunctions = { "Y", "E", "I" };
static string _romanRegex = #"\b((?:[Xx]{1,3}|[Xx][Ll]|[Ll][Xx]{0,3})?(?:[Ii]{1,3}|[Ii][VvXx]|[Vv][Ii]{0,3})?)\b";
/// <summary>
/// Case a name field into its appropriate case format
/// e.g. Smith, de la Cruz, Mary-Jane, O'Brien, McTaggart
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
public static string NameCase(string nameString)
{
// Capitalize
nameString = Capitalize(nameString);
nameString = UpdateIrish(nameString);
// Fixes for "son (daughter) of" etc
foreach (var replacement in _replacements.Keys)
{
if (Regex.IsMatch(nameString, replacement))
{
Regex rgx = new Regex(replacement);
nameString = rgx.Replace(nameString, _replacements[replacement]);
}
}
nameString = UpdateRoman(nameString);
nameString = FixConjunction(nameString);
return nameString;
}
/// <summary>
/// Capitalize first letters.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string Capitalize(string nameString)
{
nameString = nameString.ToLower();
nameString = Regex.Replace(nameString, #"\b\w", x => x.ToString().ToUpper());
nameString = Regex.Replace(nameString, #"'\w\b", x => x.ToString().ToLower()); // Lowercase 's
return nameString;
}
/// <summary>
/// Update for Irish names.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateIrish(string nameString)
{
if(Regex.IsMatch(nameString, #".*?\bMac[A-Za-z^aciozj]{2,}\b") || Regex.IsMatch(nameString, #".*?\bMc"))
{
nameString = UpdateMac(nameString);
}
return nameString;
}
/// <summary>
/// Updates irish Mac & Mc.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateMac(string nameString)
{
MatchCollection matches = Regex.Matches(nameString, #"\b(Ma?c)([A-Za-z]+)");
if(matches.Count == 1 && matches[0].Groups.Count == 3)
{
string replacement = matches[0].Groups[1].Value;
replacement += matches[0].Groups[2].Value.Substring(0, 1).ToUpper();
replacement += matches[0].Groups[2].Value.Substring(1);
nameString = nameString.Replace(matches[0].Groups[0].Value, replacement);
// Now fix "Mac" exceptions
foreach (var exception in _exceptions.Keys)
{
nameString = Regex.Replace(nameString, exception, _exceptions[exception]);
}
}
return nameString;
}
/// <summary>
/// Fix roman numeral names.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateRoman(string nameString)
{
MatchCollection matches = Regex.Matches(nameString, _romanRegex);
if (matches.Count > 1)
{
foreach(Match match in matches)
{
if(!string.IsNullOrEmpty(match.Value))
{
nameString = Regex.Replace(nameString, match.Value, x => x.ToString().ToUpper());
}
}
}
return nameString;
}
/// <summary>
/// Fix Spanish conjunctions.
/// </summary>
/// <param name=""></param>
/// <returns></returns>
private static string FixConjunction(string nameString)
{
foreach (var conjunction in _conjunctions)
{
nameString = Regex.Replace(nameString, #"\b" + conjunction + #"\b", x => x.ToString().ToLower());
}
return nameString;
}
}
Usage
string name_cased = CIQNameCase.NameCase("McCarthy");
This is my test method, everything seems to pass OK:
[TestMethod]
public void Test_NameCase_1()
{
string[] names = {
"Keith", "Yuri's", "Leigh-Williams", "McCarthy",
// Mac exceptions
"Machin", "Machlin", "Machar",
"Mackle", "Macklin", "Mackie",
"Macquarie", "Machado", "Macevicius",
"Maciulis", "Macias", "MacMurdo",
// General
"O'Callaghan", "St. John", "von Streit",
"van Dyke", "Van", "ap Llwyd Dafydd",
"al Fahd", "Al",
"el Grecco",
"ben Gurion", "Ben",
"da Vinci",
"di Caprio", "du Pont", "de Legate",
"del Crond", "der Sind", "van der Post", "van den Thillart",
"von Trapp", "la Poisson", "le Figaro",
"Mack Knife", "Dougal MacDonald",
"Ruiz y Picasso", "Dato e Iradier", "Mas i Gavarró",
// Roman numerals
"Henry VIII", "Louis III", "Louis XIV",
"Charles II", "Fred XLIX", "Yusof bin Ishak",
};
foreach(string name in names)
{
string name_upper = name.ToUpper();
string name_cased = CIQNameCase.NameCase(name_upper);
Console.WriteLine(string.Format("name: {0} -> {1} -> {2}", name, name_upper, name_cased));
Assert.IsTrue(name == name_cased);
}
}

I wrote this today to implement in an app I'm working on. I think this code is pretty self explanatory with comments. It's not 100% accurate in all cases but it will handle most of your western names easily.
Examples:
mary-jane => Mary-Jane
o'brien => O'Brien
Joël VON WINTEREGG => Joël von Winteregg
jose de la acosta => Jose de la Acosta
The code is extensible in that you may add any string value to the arrays at the top to suit your needs. Please study it and add any special feature that may be required.
function name_title_case($str)
{
// name parts that should be lowercase in most cases
$ok_to_be_lower = array('av','af','da','dal','de','del','der','di','la','le','van','der','den','vel','von');
// name parts that should be lower even if at the beginning of a name
$always_lower = array('van', 'der');
// Create an array from the parts of the string passed in
$parts = explode(" ", mb_strtolower($str));
foreach ($parts as $part)
{
(in_array($part, $ok_to_be_lower)) ? $rules[$part] = 'nocaps' : $rules[$part] = 'caps';
}
// Determine the first part in the string
reset($rules);
$first_part = key($rules);
// Loop through and cap-or-dont-cap
foreach ($rules as $part => $rule)
{
if ($rule == 'caps')
{
// ucfirst() words and also takes into account apostrophes and hyphens like this:
// O'brien -> O'Brien || mary-kaye -> Mary-Kaye
$part = str_replace('- ','-',ucwords(str_replace('-','- ', $part)));
$c13n[] = str_replace('\' ', '\'', ucwords(str_replace('\'', '\' ', $part)));
}
else if ($part == $first_part && !in_array($part, $always_lower))
{
// If the first part of the string is ok_to_be_lower, cap it anyway
$c13n[] = ucfirst($part);
}
else
{
$c13n[] = $part;
}
}
$titleized = implode(' ', $c13n);
return trim($titleized);
}

What programming language do you use? Many languages allow callback functions for regular expression matches. These can be used to propercase the match easily. The regular expression that would be used is quite simple, you just have to match all word characters, like so:
/\w+/
Alternatively, you can already extract the first character to be an extra match:
/(\w)(\w*)/
Now you can access the first character and successive characters in the match separately. The callback function can then simply return a concatenation of the hits. In pseudo Python (I don't actually know Python):
def make_proper(match):
return match[1].to_upper + match[2]
Incidentally, this would also handle the case of “O'Reilly” because “O” and “Reilly” would be matched separately and both propercased. There are however other special cases that are not handled well by the algorithm, e.g. “McDonald's” or generally any apostrophed word. The algorithm would produce “Mcdonald'S” for the latter. A special handling for apostrophe could be implemented but that would interfere with the first case. Finding a thereotical perfect solution isn't possible. In practice, it might help considering the length of the part after the apostrophe.

Here's a perhaps naive C# implementation:-
public class ProperCaseHelper {
public string ToProperCase(string input) {
string ret = string.Empty;
var words = input.Split(' ');
for (int i = 0; i < words.Length; ++i) {
ret += wordToProperCase(words[i]);
if (i < words.Length - 1) ret += " ";
}
return ret;
}
private string wordToProperCase(string word) {
if (string.IsNullOrEmpty(word)) return word;
// Standard case
string ret = capitaliseFirstLetter(word);
// Special cases:
ret = properSuffix(ret, "'");
ret = properSuffix(ret, ".");
ret = properSuffix(ret, "Mc");
ret = properSuffix(ret, "Mac");
return ret;
}
private string properSuffix(string word, string prefix) {
if(string.IsNullOrEmpty(word)) return word;
string lowerWord = word.ToLower(), lowerPrefix = prefix.ToLower();
if (!lowerWord.Contains(lowerPrefix)) return word;
int index = lowerWord.IndexOf(lowerPrefix);
// If the search string is at the end of the word ignore.
if (index + prefix.Length == word.Length) return word;
return word.Substring(0, index) + prefix +
capitaliseFirstLetter(word.Substring(index + prefix.Length));
}
private string capitaliseFirstLetter(string word) {
return char.ToUpper(word[0]) + word.Substring(1).ToLower();
}
}

I know this thread has been open for awhile, but as I was doing research for this problem I came across this nifty site, which allows you to paste in names to be capitalized quite quickly: https://dialect.ca/code/name-case/. I wanted to include it here for reference for others doing similar research/projects.
They release the algorithm they have written in php at this link: https://dialect.ca/code/name-case/name_case.phps
A preliminary test and reading of their code suggests they have been quite thorough.

a simple way to capitalise the first letter of each word (seperated by a space)
$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {
$s = strtolower($words[$i]);
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
$result .= “$s “;
}
$string = trim($result);
in terms of catching the "O'REILLY" example you gave
splitting the string on both spaces and ' would not work as it would capitalise any letter that appeared after a apostraphe i.e. the s in Fred's
so i would probably try something like
$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {
$s = strtolower($words[$i]);
if (substr($s, 0, 2) === "o'"){
$s = substr_replace($s, strtoupper(substr($s, 0, 3)), 0, 3);
}else{
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
}
$result .= “$s “;
}
$string = trim($result);
This should catch O'Reilly, O'Clock, O'Donnell etc hope it helps
Please note this code is untested.

Kronoz, thank you. I found in your function that the line:
`if (!lowerWord.Contains(lowerPrefix)) return word`;
must say
if (!lowerWord.StartsWith(lowerPrefix)) return word;
so "información" is not changed to "InforMacIón"
best,
Enrique

I use this as the textchanged event handler of text boxes. Support entry of "McDonald"
Public Shared Function DoProperCaseConvert(ByVal str As String, Optional ByVal allowCapital As Boolean = True) As String
Dim strCon As String = ""
Dim wordbreak As String = " ,.1234567890;/\-()#$%^&*€!~+=#"
Dim nextShouldBeCapital As Boolean = True
'Improve to recognize all caps input
'If str.Equals(str.ToUpper) Then
' str = str.ToLower
'End If
For Each s As Char In str.ToCharArray
If allowCapital Then
strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s)
Else
strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s.ToLower)
End If
If wordbreak.Contains(s.ToString) Then
nextShouldBeCapital = True
Else
nextShouldBeCapital = False
End If
Next
Return strCon
End Function

A lot of good answers here. Mine is pretty simple and only takes into account the names we have in our organization. You can expand it as you wish. This is not a perfect solution and will change vancouver to VanCouver, which is wrong. So tweak it if you use it.
Here was my solution in C#. This hard-codes the names into the program but with a little work you could keep a text file outside of the program and read in the name exceptions (i.e. Van, Mc, Mac) and loop through them.
public static String toProperName(String name)
{
if (name != null)
{
if (name.Length >= 2 && name.ToLower().Substring(0, 2) == "mc") // Changes mcdonald to "McDonald"
return "Mc" + Regex.Replace(name.ToLower().Substring(2), #"\b[a-z]", m => m.Value.ToUpper());
if (name.Length >= 3 && name.ToLower().Substring(0, 3) == "van") // Changes vanwinkle to "VanWinkle"
return "Van" + Regex.Replace(name.ToLower().Substring(3), #"\b[a-z]", m => m.Value.ToUpper());
return Regex.Replace(name.ToLower(), #"\b[a-z]", m => m.Value.ToUpper()); // Changes to title case but also fixes
// appostrophes like O'HARE or o'hare to O'Hare
}
return "";
}

You do not mention which language you would like the solution in so here is some pseudo code.
Loop through each character
If the previous character was an alphabet letter
Make the character lower case
Otherwise
Make the character upper case
End loop

Related

Using backslashes in the windows cmd [duplicate]

Is the following behaviour some feature or a bug in C# .NET?
Test application:
using System;
using System.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Arguments:");
foreach (string arg in args)
{
Console.WriteLine(arg);
}
Console.WriteLine();
Console.WriteLine("Command Line:");
var clArgs = Environment.CommandLine.Split(' ');
foreach (string arg in clArgs.Skip(clArgs.Length - args.Length))
{
Console.WriteLine(arg);
}
Console.ReadKey();
}
}
}
Run it with command line arguments:
a "b" "\\x\\" "\x\"
In the result I receive:
Arguments:
a
b
\\x\
\x"
Command Line:
a
"b"
"\\x\\"
"\x\"
There are missing backslashes and non-removed quote in args passed to method Main(). What is the correct workaround except manually parsing Environment.CommandLine?
According to this article by Jon Galloway, there can be weird behaviour experienced when using backslashes in command line arguments.
Most notably it mentions that "Most applications (including .NET applications) use CommandLineToArgvW to decode their command lines. It uses crazy escaping rules which explain the behaviour you're seeing."
It explains that the first set of backslashes do not require escaping, but backslashes coming after alpha (maybe numeric too?) characters require escaping and that quotes always need to be escaped.
Based off of these rules, I believe to get the arguments you want you would have to pass them as:
a "b" "\\x\\\\" "\x\\"
"Whacky" indeed.
The full story of the crazy escaping rules was told in 2011 by an MS blog entry: "Everyone quotes command line arguments the wrong way"
Raymond also had something to say on the matter (already back in 2010): "What's up with the strange treatment of quotation marks and backslashes by CommandLineToArgvW"
The situation persists into 2020 and the escaping rules described in Everyone quotes command line arguments the wrong way are still correct as of 2020 and Windows 10.
I came across this same issue the other day and had a tough time getting through it. In my googling, I came across this article regarding VB.NET (the language of my application) that solved the problem without having to change any of my other code based on the arguments.
In that article, he refers to the original article which was written for C#. Here's the actual code, you pass it Environment.CommandLine():
C#
class CommandLineTools
{
/// <summary>
/// C-like argument parser
/// </summary>
/// <param name="commandLine">Command line string with arguments. Use Environment.CommandLine</param>
/// <returns>The args[] array (argv)</returns>
public static string[] CreateArgs(string commandLine)
{
StringBuilder argsBuilder = new StringBuilder(commandLine);
bool inQuote = false;
// Convert the spaces to a newline sign so we can split at newline later on
// Only convert spaces which are outside the boundries of quoted text
for (int i = 0; i < argsBuilder.Length; i++)
{
if (argsBuilder[i].Equals('"'))
{
inQuote = !inQuote;
}
if (argsBuilder[i].Equals(' ') && !inQuote)
{
argsBuilder[i] = '\n';
}
}
// Split to args array
string[] args = argsBuilder.ToString().Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
// Clean the '"' signs from the args as needed.
for (int i = 0; i < args.Length; i++)
{
args[i] = ClearQuotes(args[i]);
}
return args;
}
/// <summary>
/// Cleans quotes from the arguments.<br/>
/// All signle quotes (") will be removed.<br/>
/// Every pair of quotes ("") will transform to a single quote.<br/>
/// </summary>
/// <param name="stringWithQuotes">A string with quotes.</param>
/// <returns>The same string if its without quotes, or a clean string if its with quotes.</returns>
private static string ClearQuotes(string stringWithQuotes)
{
int quoteIndex;
if ((quoteIndex = stringWithQuotes.IndexOf('"')) == -1)
{
// String is without quotes..
return stringWithQuotes;
}
// Linear sb scan is faster than string assignemnt if quote count is 2 or more (=always)
StringBuilder sb = new StringBuilder(stringWithQuotes);
for (int i = quoteIndex; i < sb.Length; i++)
{
if (sb[i].Equals('"'))
{
// If we are not at the last index and the next one is '"', we need to jump one to preserve one
if (i != sb.Length - 1 && sb[i + 1].Equals('"'))
{
i++;
}
// We remove and then set index one backwards.
// This is because the remove itself is going to shift everything left by 1.
sb.Remove(i--, 1);
}
}
return sb.ToString();
}
}
VB.NET:
Imports System.Text
' Original version by Jonathan Levison (C#)'
' http://sleepingbits.com/2010/01/command-line-arguments-with-double-quotes-in-net/
' converted using http://www.developerfusion.com/tools/convert/csharp-to-vb/
' and then some manual effort to fix language discrepancies
Friend Class CommandLineHelper
''' <summary>
''' C-like argument parser
''' </summary>
''' <param name="commandLine">Command line string with arguments. Use Environment.CommandLine</param>
''' <returns>The args[] array (argv)</returns>
Public Shared Function CreateArgs(commandLine As String) As String()
Dim argsBuilder As New StringBuilder(commandLine)
Dim inQuote As Boolean = False
' Convert the spaces to a newline sign so we can split at newline later on
' Only convert spaces which are outside the boundries of quoted text
For i As Integer = 0 To argsBuilder.Length - 1
If argsBuilder(i).Equals(""""c) Then
inQuote = Not inQuote
End If
If argsBuilder(i).Equals(" "c) AndAlso Not inQuote Then
argsBuilder(i) = ControlChars.Lf
End If
Next
' Split to args array
Dim args As String() = argsBuilder.ToString().Split(New Char() {ControlChars.Lf}, StringSplitOptions.RemoveEmptyEntries)
' Clean the '"' signs from the args as needed.
For i As Integer = 0 To args.Length - 1
args(i) = ClearQuotes(args(i))
Next
Return args
End Function
''' <summary>
''' Cleans quotes from the arguments.<br/>
''' All signle quotes (") will be removed.<br/>
''' Every pair of quotes ("") will transform to a single quote.<br/>
''' </summary>
''' <param name="stringWithQuotes">A string with quotes.</param>
''' <returns>The same string if its without quotes, or a clean string if its with quotes.</returns>
Private Shared Function ClearQuotes(stringWithQuotes As String) As String
Dim quoteIndex As Integer = stringWithQuotes.IndexOf(""""c)
If quoteIndex = -1 Then Return stringWithQuotes
' Linear sb scan is faster than string assignemnt if quote count is 2 or more (=always)
Dim sb As New StringBuilder(stringWithQuotes)
Dim i As Integer = quoteIndex
Do While i < sb.Length
If sb(i).Equals(""""c) Then
' If we are not at the last index and the next one is '"', we need to jump one to preserve one
If i <> sb.Length - 1 AndAlso sb(i + 1).Equals(""""c) Then
i += 1
End If
' We remove and then set index one backwards.
' This is because the remove itself is going to shift everything left by 1.
sb.Remove(System.Math.Max(System.Threading.Interlocked.Decrement(i), i + 1), 1)
End If
i += 1
Loop
Return sb.ToString()
End Function
End Class
I have escaped the problem the other way...
Instead of getting arguments already parsed I am getting the arguments string as it is and then I am using my own parser:
static void Main(string[] args)
{
var param = ParseString(Environment.CommandLine);
...
}
// The following template implements the following notation:
// -key1 = some value -key2 = "some value even with '-' character " ...
private const string ParameterQuery = "\\-(?<key>\\w+)\\s*=\\s*(\"(?<value>[^\"]*)\"|(?<value>[^\\-]*))\\s*";
private static Dictionary<string, string> ParseString(string value)
{
var regex = new Regex(ParameterQuery);
return regex.Matches(value).Cast<Match>().ToDictionary(m => m.Groups["key"].Value, m => m.Groups["value"].Value);
}
This concept lets you type quotes without the escape prefix.
After much experimentation this worked for me. I'm trying to create a command to send to the Windows command line. A folder name comes after the -graphical option in the command, and since it may have spaces in it, it has to be wrapped in double quotes. When I used back slashes to create the quotes they came out as literals in the command. So this. . . .
string q = #"" + (char) 34;
string strCmdText = string.Format(#"/C cleartool update -graphical {1}{0}{1}", this.txtViewFolder.Text, q);
System.Diagnostics.Process.Start("CMD.exe", strCmdText);
q is a string holding just a double quote character. It's preceded with # to make it a verbatim string literal.
The command template is also a verbatim string literal, and the string.Format method is used to compile everything into strCmdText.
This works for me, and it works correctly with the example in the question.
/// <summary>
/// https://www.pinvoke.net/default.aspx/shell32/CommandLineToArgvW.html
/// </summary>
/// <param name="unsplitArgumentLine"></param>
/// <returns></returns>
static string[] SplitArgs(string unsplitArgumentLine)
{
int numberOfArgs;
IntPtr ptrToSplitArgs;
string[] splitArgs;
ptrToSplitArgs = CommandLineToArgvW(unsplitArgumentLine, out numberOfArgs);
// CommandLineToArgvW returns NULL upon failure.
if (ptrToSplitArgs == IntPtr.Zero)
throw new ArgumentException("Unable to split argument.", new Win32Exception());
// Make sure the memory ptrToSplitArgs to is freed, even upon failure.
try
{
splitArgs = new string[numberOfArgs];
// ptrToSplitArgs is an array of pointers to null terminated Unicode strings.
// Copy each of these strings into our split argument array.
for (int i = 0; i < numberOfArgs; i++)
splitArgs[i] = Marshal.PtrToStringUni(
Marshal.ReadIntPtr(ptrToSplitArgs, i * IntPtr.Size));
return splitArgs;
}
finally
{
// Free memory obtained by CommandLineToArgW.
LocalFree(ptrToSplitArgs);
}
}
[DllImport("shell32.dll", SetLastError = true)]
static extern IntPtr CommandLineToArgvW(
[MarshalAs(UnmanagedType.LPWStr)] string lpCmdLine,
out int pNumArgs);
[DllImport("kernel32.dll")]
static extern IntPtr LocalFree(IntPtr hMem);
static string Reverse(string s)
{
char[] charArray = s.ToCharArray();
Array.Reverse(charArray);
return new string(charArray);
}
static string GetEscapedCommandLine()
{
StringBuilder sb = new StringBuilder();
bool gotQuote = false;
foreach (var c in Environment.CommandLine.Reverse())
{
if (c == '"')
gotQuote = true;
else if (gotQuote && c == '\\')
{
// double it
sb.Append('\\');
}
else
gotQuote = false;
sb.Append(c);
}
return Reverse(sb.ToString());
}
static void Main(string[] args)
{
// Crazy hack
args = SplitArgs(GetEscapedCommandLine()).Skip(1).ToArray();
}

splitting up the contents of a single line

I just went through a problem, where input is a string which is a single word.
This line is not readable,
Like, I want to leave is written as Iwanttoleave.
The problem is of separating out each of the tokens(words, numbers, abbreviations, etc)
I have no idea where to start
The first thought that came to my mind is making a dictionary and then mapping accordingly but I think making a dictionary is not at all a good idea.
Can anyone suggest some algorithm to do it ?
First of all, create a dictionary which helps you to identify if some string is a valid word or not.
bool isValidString(String s){
if(dictionary.contains(s))
return true;
return false;
}
Now, you can write a recursive code to split the string and create an array of actually useful words.
ArrayList usefulWords = new ArrayList<String>; //global declaration
void split(String s){
int l = s.length();
int i,j;
for(i = l-1; i >= 0; i--){
if(isValidString(s.substr(i,l)){ //s.substr(i,l) will return substring starting from index `i` and ending at `l-1`
usefulWords.add(s.substr(i,l));
split(s.substr(0,i));
}
}
}
Now, use these usefulWords to generate all possible strings. Maybe something like this:
ArrayList<String> splits = new ArrayList<String>[10]; //assuming max 10 possible outputs
ArrayList<String>[] allPossibleStrings(String s, int level){
for(int i = 0; i < s.length(); i++){
if(usefulWords.contains(s.substr(0,i)){
splits[level].add(s.substr(0,i));
allPossibleStrings(s.substr(i,s.length()),level);
level++;
}
}
}
Now, this code gives you all possible splits in a somewhat arbitrary manner. eg.
dictionary = {cat, dog, i, am, pro, gram, program, programmer, grammer}
input:
string = program
output:
splits[0] = {pro, gram}
splits[1] = {program}
input:
string = iamprogram
output:
splits[0] = {i, am, pro, gram} //since `mer` is not in dictionary
splits[1] = {program}
I did not give much thought to the last part, but I think you should be able to formulate a code from there as per your requirement.
Also, since no language is tagged, I've taken the liberty of writing the code in JAVA-like syntax as it is really easy to understand.
Instead of using a Dictionary, I'd suggest you use a Trie with all your valid words (the whole English dictionary?). Then you can start moving one letter at a time in your input line and the trie at the same time. If the letter leads to more results in the trie, you can continue expanding the current word, and if not, you can start looking for a new word in the trie.
This won't be a forward only search for sure, so you'll need some sort of backtracking.
// This method Generates a list with all the matching phrases for the given input
List<string> CandidatePhrases(string input) {
Trie validWords = BuildTheTrieWithAllValidWords();
List<string> currentWords = new List<string>();
List<string> possiblePhrases = new List<string>();
// The root of the trie has an empty key that points to all the first letters of all words
Trie currentWord = validWords;
int currentLetter = -1;
// Calls a backtracking method that creates all possible phrases
FindPossiblePhrases(input, validWords, currentWords, currentWord, currentLetter, possiblePhrases);
return possiblePhrases;
}
// The Trie structure could be something like
class Trie {
char key;
bool valid;
List<Trie> children;
Trie parent;
Trie Next(char nextLetter) {
return children.FirstOrDefault(c => c.key == nextLetter);
}
string WholeWord() {
Debug.Assert(valid);
string word = "";
Trie current = this;
while (current.Key != '\0')
{
word = current.Key + word;
current = current.parent;
}
}
}
void FindPossiblePhrases(string input, Trie validWords, List<string> currentWords, Trie currentWord, int currentLetter, List<string> possiblePhrases) {
if (currentLetter == input.Length - 1) {
if (currentWord.valid) {
string phrase = ""
foreach (string word in currentWords) {
phrase += word;
phrase += " ";
}
phrase += currentWord.WholeWord();
possiblePhrases.Add(phrase);
}
}
else {
// The currentWord may be a valid word. If that's the case, the next letter could be the first of a new word, or could be the next letter of a bigger word that begins with currentWord
if (currentWord.valid) {
// Try to match phrases when the currentWord is a valid word
currentWords.Add(currentWord.WholeWord());
FindPossiblePhrases(input, validWords, currentWords, validWords, currentLetter, possiblePhrases);
currentWords.RemoveAt(currentWords.Length - 1);
}
// If either the currentWord is a valid word, or not, try to match a longer word that begins with current word
int nextLetter = currentLetter + 1;
Trie nextWord = currentWord.Next(input[nextLetter]);
// If the nextWord is null, there was no matching word that begins with currentWord and has input[nextLetter] as the following letter.
if (nextWord != null) {
FindPossiblePhrases(input, validWords, currentWords, nextWord, nextLetter, possiblePhrases);
}
}
}

Processing: create an array of the characters within the string

I am new to processing and trying to figure out a way to create an array of all the characters within a string. Currently I Have:
String[] words = {"hello", "devak", "road", "duck", "face"};
String theWord = words[int(random(0,words.length))];
I've been googling and haven't found a good solution yet. Thanks in advance.
In addition to the comment you posted (which perhaps should have been an answer), there are a ton of ways to split a String.
The most obvious solution might be the String.split() function. If you give that function an empty String "" as an argument, it will split every character:
void setup() {
String myString = "testing testing 123";
String[] chars = myString.split("");
for (String c : chars) {
println(c);
}
}
You could also just use the String.charAt() function:
void setup() {
String myString = "testing testing 123";
for (int i = 0; i < myString.length(); i++) {
char c = myString.charAt(i);
println(c);
}
}

Algorithm to generate all variants of a word

i would like to explain my problem by the following example.
assume the word: abc
a has variants: ä, à
b has no variants.
c has variants: ç
so the possible words are:
abc
äbc
àbc
abç
äbç
àbç
now i am looking for the algorithm that prints all word variantions for abritray words with arbitray lettervariants.
I would recommend you to solve this recursively. Here's some Java code for you to get started:
static Map<Character, char[]> variants = new HashMap<Character, char[]>() {{
put('a', new char[] {'ä', 'à'});
put('b', new char[] { });
put('c', new char[] { 'ç' });
}};
public static Set<String> variation(String str) {
Set<String> result = new HashSet<String>();
if (str.isEmpty()) {
result.add("");
return result;
}
char c = str.charAt(0);
for (String tailVariant : variation(str.substring(1))) {
result.add(c + tailVariant);
for (char variant : variants.get(c))
result.add(variant + tailVariant);
}
return result;
}
Test:
public static void main(String[] args) {
for (String str : variation("abc"))
System.out.println(str);
}
Output:
abc
àbç
äbc
àbc
äbç
abç
A quickly hacked solution in Python:
def word_variants(variants):
print_variants("", 1, variants);
def print_variants(word, i, variants):
if i > len(variants):
print word
else:
for variant in variants[i]:
print_variants(word + variant, i + 1, variants)
variants = dict()
variants[1] = ['a0', 'a1', 'a2']
variants[2] = ['b0']
variants[3] = ['c0', 'c1']
word_variants(variants)
Common part:
string[] letterEquiv = { "aäà", "b", "cç", "d", "eèé" };
// Here we make a dictionary where the key is the "base" letter and the value is an array of alternatives
var lookup = letterEquiv
.Select(p => p.ToCharArray())
.SelectMany(p => p, (p, q) => new { key = q, values = p }).ToDictionary(p => p.key, p => p.values);
A recursive variation written in C#.
List<string> resultsRecursive = new List<string>();
// I'm using an anonymous method that "closes" around resultsRecursive and lookup. You could make it a standard method that accepts as a parameter the two.
// Recursive anonymous methods must be declared in this way in C#. Nothing to see.
Action<string, int, char[]> recursive = null;
recursive = (str, ix, str2) =>
{
// In the first loop str2 is null, so we create the place where the string will be built.
if (str2 == null)
{
str2 = new char[str.Length];
}
// The possible variations for the current character
var equivs = lookup[str[ix]];
// For each variation
foreach (var eq in equivs)
{
// We save the current variation for the current character
str2[ix] = eq;
// If we haven't reached the end of the string
if (ix < str.Length - 1)
{
// We recurse, increasing the index
recursive(str, ix + 1, str2);
}
else
{
// We save the string
resultsRecursive.Add(new string(str2));
}
}
};
// We launch our function
recursive("abcdeabcde", 0, null);
// The results are in resultsRecursive
A non-recursive version
List<string> resultsNonRecursive = new List<string>();
// I'm using an anonymous method that "closes" around resultsNonRecursive and lookup. You could make it a standard method that accepts as a parameter the two.
Action<string> nonRecursive = (str) =>
{
// We will have two arrays, of the same length of the string. One will contain
// the possible variations for that letter, the other will contain the "current"
// "chosen" variation of that letter
char[][] equivs = new char[str.Length][];
int[] ixes = new int[str.Length];
for (int i = 0; i < ixes.Length; i++)
{
// We start with index -1 so that the first increase will bring it to 0
equivs[i] = lookup[str[i]];
ixes[i] = -1;
}
// The current "workin" index of the original string
int ix = 0;
// The place where the string will be built.
char[] str2 = new char[str.Length];
// The loop will break when we will have to increment the letter with index -1
while (ix >= 0)
{
// We select the next possible variation for the current character
ixes[ix]++;
// If we have exausted the possible variations of the current character
if (ixes[ix] == equivs[ix].Length)
{
// Reset the current character to -1
ixes[ix] = -1;
// And loop back to the previous character
ix--;
continue;
}
// We save the current variation for the current character
str2[ix] = equivs[ix][ixes[ix]];
// If we are setting the last character of the string, then the string
// is complete
if (ix == str.Length - 1)
{
// And we save it
resultsNonRecursive.Add(new string(str2));
}
else
{
// Otherwise we have to do everything for the next character
ix++;
}
}
};
// We launch our function
nonRecursive("abcdeabcde");
// The results are in resultsNonRecursive
Both heavily commented.

How do I create a URL shortener? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef".
Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.
My approach:
I have a database table with three columns:
id, integer, auto-increment
long, string, the long URL the user entered
short, string, the shortened URL (or just the six characters)
I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.
My idea:
For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:
short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.
That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?
Due to the ongoing interest in this topic, I've published an efficient solution to GitHub, with implementations for JavaScript, PHP, Python and Java. Add your solutions if you like :)
I would continue your "convert number to string" approach. However, you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.
Theoretical background
You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means:
There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
and for every y you must be able to find an x so that f(x) = y.
How to convert the ID to a shortened URL
Think of an alphabet we want to use. In your case, that's [a-zA-Z0-9]. It contains 62 letters.
Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).
For this example, I will use 12510 (125 with a base of 10).
Now you have to convert 12510 to X62 (base 62).
12510 = 2×621 + 1×620 = [2,1]
This requires the use of integer division and modulo. A pseudo-code example:
digits = []
while num > 0
remainder = modulo(num, 62)
digits.push(remainder)
num = divide(num, 62)
digits = digits.reverse
Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:
0 → a
1 → b
...
25 → z
...
52 → 0
61 → 9
With 2 → c and 1 → b, you will receive cb62 as the shortened URL.
http://shor.ty/cb
How to resolve a shortened URL to the initial ID
The reverse is even easier. You just do a reverse lookup in your alphabet.
e9a62 will be resolved to "4th, 61st, and 0th letter in the alphabet".
e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810
Now find your database-record with WHERE id = 19158 and do the redirect.
Example implementations (provided by commenters)
C++
Python
Ruby
Haskell
C#
CoffeeScript
Perl
Why would you want to use a hash?
You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z, a-z, 0-9, etc.) has 62 characters, convert the id to a base-40 number and use the characters as the digits.
public class UrlShortener {
private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static final int BASE = ALPHABET.length();
public static String encode(int num) {
StringBuilder sb = new StringBuilder();
while ( num > 0 ) {
sb.append( ALPHABET.charAt( num % BASE ) );
num /= BASE;
}
return sb.reverse().toString();
}
public static int decode(String str) {
int num = 0;
for ( int i = 0; i < str.length(); i++ )
num = num * BASE + ALPHABET.indexOf(str.charAt(i));
return num;
}
}
Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.
Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.
My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.
Here is my PHP 5 class.
<?php
class Bijective
{
public $dictionary = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
public function __construct()
{
$this->dictionary = str_split($this->dictionary);
}
public function encode($i)
{
if ($i == 0)
return $this->dictionary[0];
$result = '';
$base = count($this->dictionary);
while ($i > 0)
{
$result[] = $this->dictionary[($i % $base)];
$i = floor($i / $base);
}
$result = array_reverse($result);
return join("", $result);
}
public function decode($input)
{
$i = 0;
$base = count($this->dictionary);
$input = str_split($input);
foreach($input as $char)
{
$pos = array_search($char, $this->dictionary);
$i = $i * $base + $pos;
}
return $i;
}
}
A Node.js and MongoDB solution
Since we know the format that MongoDB uses to create a new ObjectId with 12 bytes.
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id
a 3-byte counter (in your machine), starting with a random value.
Example (I choose a random sequence)
a1b2c3d4e5f6g7h8i9j1k2l3
a1b2c3d4 represents the seconds since the Unix epoch,
4e5f6g7 represents machine identifier,
h8i9 represents process id
j1k2l3 represents the counter, starting with a random value.
Since the counter will be unique if we are storing the data in the same machine we can get it with no doubts that it will be duplicate.
So the short URL will be the counter and here is a code snippet assuming that your server is running properly.
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
// Create a schema
const shortUrl = new Schema({
long_url: { type: String, required: true },
short_url: { type: String, required: true, unique: true },
});
const ShortUrl = mongoose.model('ShortUrl', shortUrl);
// The user can request to get a short URL by providing a long URL using a form
app.post('/shorten', function(req ,res){
// Create a new shortUrl */
// The submit form has an input with longURL as its name attribute.
const longUrl = req.body["longURL"];
const newUrl = ShortUrl({
long_url : longUrl,
short_url : "",
});
const shortUrl = newUrl._id.toString().slice(-6);
newUrl.short_url = shortUrl;
console.log(newUrl);
newUrl.save(function(err){
console.log("the new URL is added");
})
});
I keep incrementing an integer sequence per domain in the database and use Hashids to encode the integer into a URL path.
static hashids = Hashids(salt = "my app rocks", minSize = 6)
I ran a script to see how long it takes until it exhausts the character length. For six characters it can do 164,916,224 links and then goes up to seven characters. Bitly uses seven characters. Under five characters looks weird to me.
Hashids can decode the URL path back to a integer but a simpler solution is to use the entire short link sho.rt/ka8ds3 as a primary key.
Here is the full concept:
function addDomain(domain) {
table("domains").insert("domain", domain, "seq", 0)
}
function addURL(domain, longURL) {
seq = table("domains").where("domain = ?", domain).increment("seq")
shortURL = domain + "/" + hashids.encode(seq)
table("links").insert("short", shortURL, "long", longURL)
return shortURL
}
// GET /:hashcode
function handleRequest(req, res) {
shortURL = req.host + "/" + req.param("hashcode")
longURL = table("links").where("short = ?", shortURL).get("long")
res.redirect(301, longURL)
}
C# version:
public class UrlShortener
{
private static String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static int BASE = 62;
public static String encode(int num)
{
StringBuilder sb = new StringBuilder();
while ( num > 0 )
{
sb.Append( ALPHABET[( num % BASE )] );
num /= BASE;
}
StringBuilder builder = new StringBuilder();
for (int i = sb.Length - 1; i >= 0; i--)
{
builder.Append(sb[i]);
}
return builder.ToString();
}
public static int decode(String str)
{
int num = 0;
for ( int i = 0, len = str.Length; i < len; i++ )
{
num = num * BASE + ALPHABET.IndexOf( str[(i)] );
}
return num;
}
}
You could hash the entire URL, but if you just want to shorten the id, do as marcel suggested. I wrote this Python implementation:
https://gist.github.com/778542
Take a look at https://hashids.org/ it is open source and in many languages.
Their page outlines some of the pitfalls of other approaches.
If you don't want re-invent the wheel ... http://lilurl.sourceforge.net/
// simple approach
$original_id = 56789;
$shortened_id = base_convert($original_id, 10, 36);
$un_shortened_id = base_convert($shortened_id, 36, 10);
alphabet = map(chr, range(97,123)+range(65,91)) + map(str,range(0,10))
def lookup(k, a=alphabet):
if type(k) == int:
return a[k]
elif type(k) == str:
return a.index(k)
def encode(i, a=alphabet):
'''Takes an integer and returns it in the given base with mappings for upper/lower case letters and numbers 0-9.'''
try:
i = int(i)
except Exception:
raise TypeError("Input must be an integer.")
def incode(i=i, p=1, a=a):
# Here to protect p.
if i <= 61:
return lookup(i)
else:
pval = pow(62,p)
nval = i/pval
remainder = i % pval
if nval <= 61:
return lookup(nval) + incode(i % pval)
else:
return incode(i, p+1)
return incode()
def decode(s, a=alphabet):
'''Takes a base 62 string in our alphabet and returns it in base10.'''
try:
s = str(s)
except Exception:
raise TypeError("Input must be a string.")
return sum([lookup(i) * pow(62,p) for p,i in enumerate(list(reversed(s)))])a
Here's my version for whomever needs it.
Why not just translate your id to a string? You just need a function that maps a digit between, say, 0 and 61 to a single letter (upper/lower case) or digit. Then apply this to create, say, 4-letter codes, and you've got 14.7 million URLs covered.
Here is a decent URL encoding function for PHP...
// From http://snipplr.com/view/22246/base62-encode--decode/
private function base_encode($val, $base=62, $chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') {
$str = '';
do {
$i = fmod($val, $base);
$str = $chars[$i] . $str;
$val = ($val - $i) / $base;
} while($val > 0);
return $str;
}
Don't know if anyone will find this useful - it is more of a 'hack n slash' method, yet is simple and works nicely if you want only specific chars.
$dictionary = "abcdfghjklmnpqrstvwxyz23456789";
$dictionary = str_split($dictionary);
// Encode
$str_id = '';
$base = count($dictionary);
while($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $dictionary[$rem];
}
// Decode
$id_ar = str_split($str_id);
$id = 0;
for($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i-1], $dictionary) * pow($base, $i - 1);
}
Did you omit O, 0, and i on purpose?
I just created a PHP class based on Ryan's solution.
<?php
$shorty = new App_Shorty();
echo 'ID: ' . 1000;
echo '<br/> Short link: ' . $shorty->encode(1000);
echo '<br/> Decoded Short Link: ' . $shorty->decode($shorty->encode(1000));
/**
* A nice shorting class based on Ryan Charmley's suggestion see the link on Stack Overflow below.
* #author Svetoslav Marinov (Slavi) | http://WebWeb.ca
* #see http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener/10386945#10386945
*/
class App_Shorty {
/**
* Explicitly omitted: i, o, 1, 0 because they are confusing. Also use only lowercase ... as
* dictating this over the phone might be tough.
* #var string
*/
private $dictionary = "abcdfghjklmnpqrstvwxyz23456789";
private $dictionary_array = array();
public function __construct() {
$this->dictionary_array = str_split($this->dictionary);
}
/**
* Gets ID and converts it into a string.
* #param int $id
*/
public function encode($id) {
$str_id = '';
$base = count($this->dictionary_array);
while ($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $this->dictionary_array[$rem];
}
return $str_id;
}
/**
* Converts /abc into an integer ID
* #param string
* #return int $id
*/
public function decode($str_id) {
$id = 0;
$id_ar = str_split($str_id);
$base = count($this->dictionary_array);
for ($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i - 1], $this->dictionary_array) * pow($base, $i - 1);
}
return $id;
}
}
?>
public class TinyUrl {
private final String characterMap = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private final int charBase = characterMap.length();
public String covertToCharacter(int num){
StringBuilder sb = new StringBuilder();
while (num > 0){
sb.append(characterMap.charAt(num % charBase));
num /= charBase;
}
return sb.reverse().toString();
}
public int covertToInteger(String str){
int num = 0;
for(int i = 0 ; i< str.length(); i++)
num += characterMap.indexOf(str.charAt(i)) * Math.pow(charBase , (str.length() - (i + 1)));
return num;
}
}
class TinyUrlTest{
public static void main(String[] args) {
TinyUrl tinyUrl = new TinyUrl();
int num = 122312215;
String url = tinyUrl.covertToCharacter(num);
System.out.println("Tiny url: " + url);
System.out.println("Id: " + tinyUrl.covertToInteger(url));
}
}
This is what I use:
# Generate a [0-9a-zA-Z] string
ALPHABET = map(str,range(0, 10)) + map(chr, range(97, 123) + range(65, 91))
def encode_id(id_number, alphabet=ALPHABET):
"""Convert an integer to a string."""
if id_number == 0:
return alphabet[0]
alphabet_len = len(alphabet) # Cache
result = ''
while id_number > 0:
id_number, mod = divmod(id_number, alphabet_len)
result = alphabet[mod] + result
return result
def decode_id(id_string, alphabet=ALPHABET):
"""Convert a string to an integer."""
alphabet_len = len(alphabet) # Cache
return sum([alphabet.index(char) * pow(alphabet_len, power) for power, char in enumerate(reversed(id_string))])
It's very fast and can take long integers.
For a similar project, to get a new key, I make a wrapper function around a random string generator that calls the generator until I get a string that hasn't already been used in my hashtable. This method will slow down once your name space starts to get full, but as you have said, even with only 6 characters, you have plenty of namespace to work with.
I have a variant of the problem, in that I store web pages from many different authors and need to prevent discovery of pages by guesswork. So my short URLs add a couple of extra digits to the Base-62 string for the page number. These extra digits are generated from information in the page record itself and they ensure that only 1 in 3844 URLs are valid (assuming 2-digit Base-62). You can see an outline description at http://mgscan.com/MBWL.
Very good answer, I have created a Golang implementation of the bjf:
package bjf
import (
"math"
"strings"
"strconv"
)
const alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
func Encode(num string) string {
n, _ := strconv.ParseUint(num, 10, 64)
t := make([]byte, 0)
/* Special case */
if n == 0 {
return string(alphabet[0])
}
/* Map */
for n > 0 {
r := n % uint64(len(alphabet))
t = append(t, alphabet[r])
n = n / uint64(len(alphabet))
}
/* Reverse */
for i, j := 0, len(t) - 1; i < j; i, j = i + 1, j - 1 {
t[i], t[j] = t[j], t[i]
}
return string(t)
}
func Decode(token string) int {
r := int(0)
p := float64(len(token)) - 1
for i := 0; i < len(token); i++ {
r += strings.Index(alphabet, string(token[i])) * int(math.Pow(float64(len(alphabet)), p))
p--
}
return r
}
Hosted at github: https://github.com/xor-gate/go-bjf
Implementation in Scala:
class Encoder(alphabet: String) extends (Long => String) {
val Base = alphabet.size
override def apply(number: Long) = {
def encode(current: Long): List[Int] = {
if (current == 0) Nil
else (current % Base).toInt :: encode(current / Base)
}
encode(number).reverse
.map(current => alphabet.charAt(current)).mkString
}
}
class Decoder(alphabet: String) extends (String => Long) {
val Base = alphabet.size
override def apply(string: String) = {
def decode(current: Long, encodedPart: String): Long = {
if (encodedPart.size == 0) current
else decode(current * Base + alphabet.indexOf(encodedPart.head),encodedPart.tail)
}
decode(0,string)
}
}
Test example with Scala test:
import org.scalatest.{FlatSpec, Matchers}
class DecoderAndEncoderTest extends FlatSpec with Matchers {
val Alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
"A number with base 10" should "be correctly encoded into base 62 string" in {
val encoder = new Encoder(Alphabet)
encoder(127) should be ("cd")
encoder(543513414) should be ("KWGPy")
}
"A base 62 string" should "be correctly decoded into a number with base 10" in {
val decoder = new Decoder(Alphabet)
decoder("cd") should be (127)
decoder("KWGPy") should be (543513414)
}
}
Function based in Xeoncross Class
function shortly($input){
$dictionary = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','0','1','2','3','4','5','6','7','8','9'];
if($input===0)
return $dictionary[0];
$base = count($dictionary);
if(is_numeric($input)){
$result = [];
while($input > 0){
$result[] = $dictionary[($input % $base)];
$input = floor($input / $base);
}
return join("", array_reverse($result));
}
$i = 0;
$input = str_split($input);
foreach($input as $char){
$pos = array_search($char, $dictionary);
$i = $i * $base + $pos;
}
return $i;
}
Here is a Node.js implementation that is likely to bit.ly. generate a highly random seven-character string.
It uses Node.js crypto to generate a highly random 25 charset rather than randomly selecting seven characters.
var crypto = require("crypto");
exports.shortURL = new function () {
this.getShortURL = function () {
var sURL = '',
_rand = crypto.randomBytes(25).toString('hex'),
_base = _rand.length;
for (var i = 0; i < 7; i++)
sURL += _rand.charAt(Math.floor(Math.random() * _rand.length));
return sURL;
};
}
My Python 3 version
base_list = list("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
base = len(base_list)
def encode(num: int):
result = []
if num == 0:
result.append(base_list[0])
while num > 0:
result.append(base_list[num % base])
num //= base
print("".join(reversed(result)))
def decode(code: str):
num = 0
code_list = list(code)
for index, code in enumerate(reversed(code_list)):
num += base_list.index(code) * base ** index
print(num)
if __name__ == '__main__':
encode(341413134141)
decode("60FoItT")
For a quality Node.js / JavaScript solution, see the id-shortener module, which is thoroughly tested and has been used in production for months.
It provides an efficient id / URL shortener backed by pluggable storage defaulting to Redis, and you can even customize your short id character set and whether or not shortening is idempotent. This is an important distinction that not all URL shorteners take into account.
In relation to other answers here, this module implements the Marcel Jackwerth's excellent accepted answer above.
The core of the solution is provided by the following Redis Lua snippet:
local sequence = redis.call('incr', KEYS[1])
local chars = '0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz'
local remaining = sequence
local slug = ''
while (remaining > 0) do
local d = (remaining % 60)
local character = string.sub(chars, d + 1, d + 1)
slug = character .. slug
remaining = (remaining - d) / 60
end
redis.call('hset', KEYS[2], slug, ARGV[1])
return slug
Why not just generate a random string and append it to the base URL? This is a very simplified version of doing this in C#.
static string chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
static string baseUrl = "https://google.com/";
private static string RandomString(int length)
{
char[] s = new char[length];
Random rnd = new Random();
for (int x = 0; x < length; x++)
{
s[x] = chars[rnd.Next(chars.Length)];
}
Thread.Sleep(10);
return new String(s);
}
Then just add the append the random string to the baseURL:
string tinyURL = baseUrl + RandomString(5);
Remember this is a very simplified version of doing this and it's possible the RandomString method could create duplicate strings. In production you would want to take in account for duplicate strings to ensure you will always have a unique URL. I have some code that takes account for duplicate strings by querying a database table I could share if anyone is interested.
This is my initial thoughts, and more thinking can be done, or some simulation can be made to see if it works well or any improvement is needed:
My answer is to remember the long URL in the database, and use the ID 0 to 9999999999999999 (or however large the number is needed).
But the ID 0 to 9999999999999999 can be an issue, because
it can be shorter if we use hexadecimal, or even base62 or base64. (base64 just like YouTube using A-Z a-z 0-9 _ and -)
if it increases from 0 to 9999999999999999 uniformly, then hackers can visit them in that order and know what URLs people are sending each other, so it can be a privacy issue
We can do this:
have one server allocate 0 to 999 to one server, Server A, so now Server A has 1000 of such IDs. So if there are 20 or 200 servers constantly wanting new IDs, it doesn't have to keep asking for each new ID, but rather asking once for 1000 IDs
for the ID 1, for example, reverse the bits. So 000...00000001 becomes 10000...000, so that when converted to base64, it will be non-uniformly increasing IDs each time.
use XOR to flip the bits for the final IDs. For example, XOR with 0xD5AA96...2373 (like a secret key), and the some bits will be flipped. (whenever the secret key has the 1 bit on, it will flip the bit of the ID). This will make the IDs even harder to guess and appear more random
Following this scheme, the single server that allocates the IDs can form the IDs, and so can the 20 or 200 servers requesting the allocation of IDs. The allocating server has to use a lock / semaphore to prevent two requesting servers from getting the same batch (or if it is accepting one connection at a time, this already solves the problem). So we don't want the line (queue) to be too long for waiting to get an allocation. So that's why allocating 1000 or 10000 at a time can solve the issue.

Resources