Using backslashes in the windows cmd [duplicate] - windows
Is the following behaviour some feature or a bug in C# .NET?
Test application:
using System;
using System.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Arguments:");
foreach (string arg in args)
{
Console.WriteLine(arg);
}
Console.WriteLine();
Console.WriteLine("Command Line:");
var clArgs = Environment.CommandLine.Split(' ');
foreach (string arg in clArgs.Skip(clArgs.Length - args.Length))
{
Console.WriteLine(arg);
}
Console.ReadKey();
}
}
}
Run it with command line arguments:
a "b" "\\x\\" "\x\"
In the result I receive:
Arguments:
a
b
\\x\
\x"
Command Line:
a
"b"
"\\x\\"
"\x\"
There are missing backslashes and non-removed quote in args passed to method Main(). What is the correct workaround except manually parsing Environment.CommandLine?
According to this article by Jon Galloway, there can be weird behaviour experienced when using backslashes in command line arguments.
Most notably it mentions that "Most applications (including .NET applications) use CommandLineToArgvW to decode their command lines. It uses crazy escaping rules which explain the behaviour you're seeing."
It explains that the first set of backslashes do not require escaping, but backslashes coming after alpha (maybe numeric too?) characters require escaping and that quotes always need to be escaped.
Based off of these rules, I believe to get the arguments you want you would have to pass them as:
a "b" "\\x\\\\" "\x\\"
"Whacky" indeed.
The full story of the crazy escaping rules was told in 2011 by an MS blog entry: "Everyone quotes command line arguments the wrong way"
Raymond also had something to say on the matter (already back in 2010): "What's up with the strange treatment of quotation marks and backslashes by CommandLineToArgvW"
The situation persists into 2020 and the escaping rules described in Everyone quotes command line arguments the wrong way are still correct as of 2020 and Windows 10.
I came across this same issue the other day and had a tough time getting through it. In my googling, I came across this article regarding VB.NET (the language of my application) that solved the problem without having to change any of my other code based on the arguments.
In that article, he refers to the original article which was written for C#. Here's the actual code, you pass it Environment.CommandLine():
C#
class CommandLineTools
{
/// <summary>
/// C-like argument parser
/// </summary>
/// <param name="commandLine">Command line string with arguments. Use Environment.CommandLine</param>
/// <returns>The args[] array (argv)</returns>
public static string[] CreateArgs(string commandLine)
{
StringBuilder argsBuilder = new StringBuilder(commandLine);
bool inQuote = false;
// Convert the spaces to a newline sign so we can split at newline later on
// Only convert spaces which are outside the boundries of quoted text
for (int i = 0; i < argsBuilder.Length; i++)
{
if (argsBuilder[i].Equals('"'))
{
inQuote = !inQuote;
}
if (argsBuilder[i].Equals(' ') && !inQuote)
{
argsBuilder[i] = '\n';
}
}
// Split to args array
string[] args = argsBuilder.ToString().Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
// Clean the '"' signs from the args as needed.
for (int i = 0; i < args.Length; i++)
{
args[i] = ClearQuotes(args[i]);
}
return args;
}
/// <summary>
/// Cleans quotes from the arguments.<br/>
/// All signle quotes (") will be removed.<br/>
/// Every pair of quotes ("") will transform to a single quote.<br/>
/// </summary>
/// <param name="stringWithQuotes">A string with quotes.</param>
/// <returns>The same string if its without quotes, or a clean string if its with quotes.</returns>
private static string ClearQuotes(string stringWithQuotes)
{
int quoteIndex;
if ((quoteIndex = stringWithQuotes.IndexOf('"')) == -1)
{
// String is without quotes..
return stringWithQuotes;
}
// Linear sb scan is faster than string assignemnt if quote count is 2 or more (=always)
StringBuilder sb = new StringBuilder(stringWithQuotes);
for (int i = quoteIndex; i < sb.Length; i++)
{
if (sb[i].Equals('"'))
{
// If we are not at the last index and the next one is '"', we need to jump one to preserve one
if (i != sb.Length - 1 && sb[i + 1].Equals('"'))
{
i++;
}
// We remove and then set index one backwards.
// This is because the remove itself is going to shift everything left by 1.
sb.Remove(i--, 1);
}
}
return sb.ToString();
}
}
VB.NET:
Imports System.Text
' Original version by Jonathan Levison (C#)'
' http://sleepingbits.com/2010/01/command-line-arguments-with-double-quotes-in-net/
' converted using http://www.developerfusion.com/tools/convert/csharp-to-vb/
' and then some manual effort to fix language discrepancies
Friend Class CommandLineHelper
''' <summary>
''' C-like argument parser
''' </summary>
''' <param name="commandLine">Command line string with arguments. Use Environment.CommandLine</param>
''' <returns>The args[] array (argv)</returns>
Public Shared Function CreateArgs(commandLine As String) As String()
Dim argsBuilder As New StringBuilder(commandLine)
Dim inQuote As Boolean = False
' Convert the spaces to a newline sign so we can split at newline later on
' Only convert spaces which are outside the boundries of quoted text
For i As Integer = 0 To argsBuilder.Length - 1
If argsBuilder(i).Equals(""""c) Then
inQuote = Not inQuote
End If
If argsBuilder(i).Equals(" "c) AndAlso Not inQuote Then
argsBuilder(i) = ControlChars.Lf
End If
Next
' Split to args array
Dim args As String() = argsBuilder.ToString().Split(New Char() {ControlChars.Lf}, StringSplitOptions.RemoveEmptyEntries)
' Clean the '"' signs from the args as needed.
For i As Integer = 0 To args.Length - 1
args(i) = ClearQuotes(args(i))
Next
Return args
End Function
''' <summary>
''' Cleans quotes from the arguments.<br/>
''' All signle quotes (") will be removed.<br/>
''' Every pair of quotes ("") will transform to a single quote.<br/>
''' </summary>
''' <param name="stringWithQuotes">A string with quotes.</param>
''' <returns>The same string if its without quotes, or a clean string if its with quotes.</returns>
Private Shared Function ClearQuotes(stringWithQuotes As String) As String
Dim quoteIndex As Integer = stringWithQuotes.IndexOf(""""c)
If quoteIndex = -1 Then Return stringWithQuotes
' Linear sb scan is faster than string assignemnt if quote count is 2 or more (=always)
Dim sb As New StringBuilder(stringWithQuotes)
Dim i As Integer = quoteIndex
Do While i < sb.Length
If sb(i).Equals(""""c) Then
' If we are not at the last index and the next one is '"', we need to jump one to preserve one
If i <> sb.Length - 1 AndAlso sb(i + 1).Equals(""""c) Then
i += 1
End If
' We remove and then set index one backwards.
' This is because the remove itself is going to shift everything left by 1.
sb.Remove(System.Math.Max(System.Threading.Interlocked.Decrement(i), i + 1), 1)
End If
i += 1
Loop
Return sb.ToString()
End Function
End Class
I have escaped the problem the other way...
Instead of getting arguments already parsed I am getting the arguments string as it is and then I am using my own parser:
static void Main(string[] args)
{
var param = ParseString(Environment.CommandLine);
...
}
// The following template implements the following notation:
// -key1 = some value -key2 = "some value even with '-' character " ...
private const string ParameterQuery = "\\-(?<key>\\w+)\\s*=\\s*(\"(?<value>[^\"]*)\"|(?<value>[^\\-]*))\\s*";
private static Dictionary<string, string> ParseString(string value)
{
var regex = new Regex(ParameterQuery);
return regex.Matches(value).Cast<Match>().ToDictionary(m => m.Groups["key"].Value, m => m.Groups["value"].Value);
}
This concept lets you type quotes without the escape prefix.
After much experimentation this worked for me. I'm trying to create a command to send to the Windows command line. A folder name comes after the -graphical option in the command, and since it may have spaces in it, it has to be wrapped in double quotes. When I used back slashes to create the quotes they came out as literals in the command. So this. . . .
string q = #"" + (char) 34;
string strCmdText = string.Format(#"/C cleartool update -graphical {1}{0}{1}", this.txtViewFolder.Text, q);
System.Diagnostics.Process.Start("CMD.exe", strCmdText);
q is a string holding just a double quote character. It's preceded with # to make it a verbatim string literal.
The command template is also a verbatim string literal, and the string.Format method is used to compile everything into strCmdText.
This works for me, and it works correctly with the example in the question.
/// <summary>
/// https://www.pinvoke.net/default.aspx/shell32/CommandLineToArgvW.html
/// </summary>
/// <param name="unsplitArgumentLine"></param>
/// <returns></returns>
static string[] SplitArgs(string unsplitArgumentLine)
{
int numberOfArgs;
IntPtr ptrToSplitArgs;
string[] splitArgs;
ptrToSplitArgs = CommandLineToArgvW(unsplitArgumentLine, out numberOfArgs);
// CommandLineToArgvW returns NULL upon failure.
if (ptrToSplitArgs == IntPtr.Zero)
throw new ArgumentException("Unable to split argument.", new Win32Exception());
// Make sure the memory ptrToSplitArgs to is freed, even upon failure.
try
{
splitArgs = new string[numberOfArgs];
// ptrToSplitArgs is an array of pointers to null terminated Unicode strings.
// Copy each of these strings into our split argument array.
for (int i = 0; i < numberOfArgs; i++)
splitArgs[i] = Marshal.PtrToStringUni(
Marshal.ReadIntPtr(ptrToSplitArgs, i * IntPtr.Size));
return splitArgs;
}
finally
{
// Free memory obtained by CommandLineToArgW.
LocalFree(ptrToSplitArgs);
}
}
[DllImport("shell32.dll", SetLastError = true)]
static extern IntPtr CommandLineToArgvW(
[MarshalAs(UnmanagedType.LPWStr)] string lpCmdLine,
out int pNumArgs);
[DllImport("kernel32.dll")]
static extern IntPtr LocalFree(IntPtr hMem);
static string Reverse(string s)
{
char[] charArray = s.ToCharArray();
Array.Reverse(charArray);
return new string(charArray);
}
static string GetEscapedCommandLine()
{
StringBuilder sb = new StringBuilder();
bool gotQuote = false;
foreach (var c in Environment.CommandLine.Reverse())
{
if (c == '"')
gotQuote = true;
else if (gotQuote && c == '\\')
{
// double it
sb.Append('\\');
}
else
gotQuote = false;
sb.Append(c);
}
return Reverse(sb.ToString());
}
static void Main(string[] args)
{
// Crazy hack
args = SplitArgs(GetEscapedCommandLine()).Skip(1).ToArray();
}
Related
Java (move point in string)
I have a string = "abc"; And also I have a point ".". How I can move this point "." in that string("abc"). Example : Input date = "abc". Output date = "abc", "a.bc", "ab.c", "a.b.c". Thanks'. public class MovePoint { public static void main(String[] args) { String str = "abcd"; String str1 = "."; String[] ara = new String[str.length()]; for (int i = 0; i < str.length(); i++) { ara[i] = str.substring(i, 1) + str1 + str.substring(1, 2); System.out.print(Arrays.toString(ara)); } } }
Here is one way to do it. This uses StringBuilder as well as a plain char array to avoid having another loop over the array to build the last String, but it therefore consumes more memory. First I print the first desired output, which is just the unmodified input String. Then I create a StringBuilder which can hold all chars from the input + one more for the chosen separator to avoid unnecessary array resizing. Then I initialize the StringBuilder so that it is in the form of the second desired ouput [char, sep, char, ...]. I am using StringBuilder here because it is just more convenient as it has the append() function that I need here. Last but not least I also initialize a char array which will hold the values for the last String to avoid having to iterate over the array twice to generate that. Now I loop over over the StringBuilder starting from one (as its already initialize to the first result with separator) to the last character. In this loop I do three things. Print out the current value of StringBuilder Swap the separator with the next character in the StringBuilder Put the character and separator to the correct position in the char array as required for the last string After the loop the last desired output is computed and I just have to print it to the console. Runtime for this in BigO-notation would be O(n). public static void main(String[] args) { String str = "abcd"; char sep = '.'; movePoint(str, sep); } public static void movePoint(String str, char sep){ // print first desired output System.out.println(str); // String builder that can hold str.length + 1 characters, so no unnecessary resizing happens var sb = new StringBuilder(str.length() + 1); // fill with first char sb.append(str.charAt(0)); // add separator sb.append(sep); // add rest of the string sb.append(str.substring(1)); // Array that holds the last string var lastStr = new char[str.length() + str.length() - 1]; for (int i = 1; i < sb.capacity() - 1; i++) { System.out.println(sb); // build current string // swap separator with next character var temp = sb.charAt(i); sb.setCharAt(i, sb.charAt(i+1)); sb.setCharAt(i+1, temp); // manipulate char array so last string is built correctly int doubled = i << 1; // set character at correct position lastStr[doubled - 2] = sb.charAt(i-1); // set separator at correct position lastStr[doubled - 1] = sep; } // add last character of string to this char array lastStr[lastStr.length - 1] = sb.charAt(sb.length() - 2); // print last desired output System.out.println(lastStr); } Expected output: abcd a.bcd ab.cd abc.d a.b.c.d
SSIS For Loop Stopped working
I have a for loop container within my ssis package which contains a script and a sql task. I have 3 variables. source.string = this is folder location file.string = i have used wildcard = *.csv exist.int = defaulted to 0 I have the innitexpression value set to #Exists=1 and the evalexpression value set to #Exists=1 in the script I have set it to look at source variable and if file.string variable exists then set exist variable to 1 problem is it just loops it should only loop if no file there. cant see how I've done this wrong it was working before I changed the variable to be a wildcard *.csv I have tested it using another variable which contains a filename rather than a wildcard and it works correctly the issue is when looking for a wildcard for the filename followed by the extension. why is this? can I not pass through a wildcard variable? my script task is public void Main() { // TODO: Add your code here string Filepath = Dts.Variables["User::Source"].Value.ToString() + Dts.Variables["User::file"].Value.ToString(); if ( File.Exists(Filepath)) { Dts.Variables["User::Exists"].Value = 1; } /// MessageBox.Show (Filepath); /// MessageBox.Show(Dts.Variables["Exists"].Value.ToString()); Dts.TaskResult = (int)ScriptResults.Success; } #region ScriptResults declaration /// <summary> /// This enum provides a convenient shorthand within the scope of this class for setting the /// result of the script. /// /// This code was generated automatically. /// </summary> enum ScriptResults { Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success, Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure }; #endregion } }
Based on comments above i made 2 different solutions. The solution for you right now would be no. 2 This one can search for a specific file based on multiple files in your path. It need some tweaking but can be used if you wanna check if a specific file exists with wildcard This one evaluates to true if any wildcard file is found. C# Code 1 Using System.IO: string Filepath = Dts.Variables["User::Source"].Value.ToString(); string WildCard = Dts.Variables["User::file"].Value.ToString(); // In Text form #"*.txt"; string fullpath = Filepath + WildCard; //With for loop string txtFile = null; // Gets all files with wildcard string[] allfiles = Directory.GetFiles(Filepath, WildCard); //Loop through all files and set the filename in txtFile. Do whatever you want here foreach(string fileName in allfiles) { //Check if a file contains something, it could be a prefixed name you only want if(fileName.Contains("txt")) { txtFile = fileName; if(File.Exists(txtFile)) { Dts.Variables["User::Exists"].Value = 1; } } } C# Code 2 Using System.IO; Using System.Linq; string Filepath = Dts.Variables["User::Source"].Value.ToString(); string WildCard = Dts.Variables["User::file"].Value.ToString(); //In text form "*.txt"; string fullpath = Filepath + WildCard; //With bool bool exists = Directory.EnumerateFiles(Filepath, WildCard).Any(); if(exists == true) { Dts.Variables["User::Exists"].Value = 1; } MessageBox.Show (Filepath); MessageBox.Show(Dts.Variables["Exists"].Value.ToString());
Processing: create an array of the characters within the string
I am new to processing and trying to figure out a way to create an array of all the characters within a string. Currently I Have: String[] words = {"hello", "devak", "road", "duck", "face"}; String theWord = words[int(random(0,words.length))]; I've been googling and haven't found a good solution yet. Thanks in advance.
In addition to the comment you posted (which perhaps should have been an answer), there are a ton of ways to split a String. The most obvious solution might be the String.split() function. If you give that function an empty String "" as an argument, it will split every character: void setup() { String myString = "testing testing 123"; String[] chars = myString.split(""); for (String c : chars) { println(c); } } You could also just use the String.charAt() function: void setup() { String myString = "testing testing 123"; for (int i = 0; i < myString.length(); i++) { char c = myString.charAt(i); println(c); } }
What's the simplest algorithm to escape a single character?
I'm trying to write two functions escape(text, delimiter) and unescape(text, delimiter) with the following properties: The result of escape does not contain delimiter. unescape is the reverse of escape, i.e. unescape(escape(text, delimiter), delimiter) == text for all values of text and delimiter It is OK to restrict the allowed values of delimiter. Background: I want to create a delimiter-separated string of values. To be able to extract the same list out of the string again, I must ensure that the individual, separated strings do not contain the separator. What I've tried: I came up with a simple solution (pseudo-code): escape(text, delimiter): return text.Replace("\", "\\").Replace(delimiter, "\d") unescape(text, delimiter): return text.Replace("\d", delimiter).Replace("\\", "\") but discovered that property 2 failed on the test string "\d<delimiter>". Currently, I have the following working solution escape(text, delimiter): return text.Replace("\", "\b").Replace(delimiter, "\d") unescape(text, delimiter): return text.Replace("\d", delimiter).Replace("\b", "\") which seems to work, as long as delimiter is not \, b or d (which is fine, I don't want to use those as delimiters anyway). However, since I have not formally proven its correctness, I'm afraid that I have missed some case where one of the properties is violated. Since this is such a common problem, I assume that there is already a "well-known proven-correct" algorithm for this, hence my question (see title).
Your first algorithm is correct. The error is in the implementation of unescape(): you need to replace both \d by delimiter and \\ by \, in the same pass. You can't use several calls to Replace() like this. Here's some sample C# code for safe quoting of delimiter-separated strings: static string QuoteSeparator(string str, char separator, char quoteChar, char otherChar) // "~" -> "~~" ";" -> "~s" { var sb = new StringBuilder(str.Length); foreach (char c in str) { if (c == quoteChar) { sb.Append(quoteChar); sb.Append(quoteChar); } else if (c == separator) { sb.Append(quoteChar); sb.Append(otherChar); } else { sb.Append(c); } } return sb.ToString(); // no separator in the result -> Join/Split is safe } static string UnquoteSeparator(string str, char separator, char quoteChar, char otherChar) // "~~" -> "~" "~s" -> ";" { var sb = new StringBuilder(str.Length); bool isQuoted = false; foreach (char c in str) { if (isQuoted) { if (c == otherChar) sb.Append(separator); else sb.Append(c); isQuoted = false; } else { if (c == quoteChar) isQuoted = true; else sb.Append(c); } } if (isQuoted) throw new ArgumentException("input string is not correctly quoted"); return sb.ToString(); // ";" are restored } /// <summary> /// Encodes the given strings as a single string. /// </summary> /// <param name="input">The strings.</param> /// <param name="separator">The separator.</param> /// <param name="quoteChar">The quote char.</param> /// <param name="otherChar">The other char.</param> /// <returns></returns> public static string QuoteAndJoin(this IEnumerable<string> input, char separator = ';', char quoteChar = '~', char otherChar = 's') { CommonHelper.CheckNullReference(input, "input"); if (separator == quoteChar || quoteChar == otherChar || separator == otherChar) throw new ArgumentException("cannot quote: ambiguous format"); return string.Join(new string(separator, 1), (from str in input select QuoteSeparator(str, separator, quoteChar, otherChar)).ToArray()); } /// <summary> /// Decodes the strings encoded in a single string. /// </summary> /// <param name="encoded">The encoded.</param> /// <param name="separator">The separator.</param> /// <param name="quoteChar">The quote char.</param> /// <param name="otherChar">The other char.</param> /// <returns></returns> public static IEnumerable<string> SplitAndUnquote(this string encoded, char separator = ';', char quoteChar = '~', char otherChar = 's') { CommonHelper.CheckNullReference(encoded, "encoded"); if (separator == quoteChar || quoteChar == otherChar || separator == otherChar) throw new ArgumentException("cannot unquote: ambiguous format"); return from s in encoded.Split(separator) select UnquoteSeparator(s, separator, quoteChar, otherChar); }
Maybe you can have an alternative replacement for the case when the delimiter does start with \, b or d. Use the same alternative replacement in the unescape algorithm as well
Does anyone have a good Proper Case algorithm
Does anyone have a trusted Proper Case or PCase algorithm (similar to a UCase or Upper)? I'm looking for something that takes a value such as "GEORGE BURDELL" or "george burdell" and turns it into "George Burdell". I have a simple one that handles the simple cases. The ideal would be to have something that can handle things such as "O'REILLY" and turn it into "O'Reilly", but I know that is tougher. I am mainly focused on the English language if that simplifies things. UPDATE: I'm using C# as the language, but I can convert from almost anything (assuming like functionality exists). I agree that the McDonald's scneario is a tough one. I meant to mention that along with my O'Reilly example, but did not in the original post.
Unless I've misunderstood your question I don't think you need to roll your own, the TextInfo class can do it for you. using System.Globalization; CultureInfo.InvariantCulture.TextInfo.ToTitleCase("GeOrGE bUrdEll") Will return "George Burdell. And you can use your own culture if there's some special rules involved. Update: Michael (in a comment to this answer) pointed out that this will not work if the input is all caps since the method will assume that it is an acronym. The naive workaround for this is to .ToLower() the text before submitting it to ToTitleCase.
#zwol: I'll post it as a separate reply. Here's an example based on ljs's post. void Main() { List<string> names = new List<string>() { "bill o'reilly", "johannes diderik van der waals", "mr. moseley-williams", "Joe VanWyck", "mcdonald's", "william the third", "hrh prince charles", "h.r.m. queen elizabeth the third", "william gates, iii", "pope leo xii", "a.k. jennings" }; names.Select(name => name.ToProperCase()).Dump(); } // http://stackoverflow.com/questions/32149/does-anyone-have-a-good-proper-case-algorithm public static class ProperCaseHelper { public static string ToProperCase(this string input) { if (IsAllUpperOrAllLower(input)) { // fix the ALL UPPERCASE or all lowercase names return string.Join(" ", input.Split(' ').Select(word => wordToProperCase(word))); } else { // leave the CamelCase or Propercase names alone return input; } } public static bool IsAllUpperOrAllLower(this string input) { return (input.ToLower().Equals(input) || input.ToUpper().Equals(input)); } private static string wordToProperCase(string word) { if (string.IsNullOrEmpty(word)) return word; // Standard case string ret = capitaliseFirstLetter(word); // Special cases: ret = properSuffix(ret, "'"); // D'Artagnon, D'Silva ret = properSuffix(ret, "."); // ??? ret = properSuffix(ret, "-"); // Oscar-Meyer-Weiner ret = properSuffix(ret, "Mc", t => t.Length > 4); // Scots ret = properSuffix(ret, "Mac", t => t.Length > 5); // Scots except Macey // Special words: ret = specialWords(ret, "van"); // Dick van Dyke ret = specialWords(ret, "von"); // Baron von Bruin-Valt ret = specialWords(ret, "de"); ret = specialWords(ret, "di"); ret = specialWords(ret, "da"); // Leonardo da Vinci, Eduardo da Silva ret = specialWords(ret, "of"); // The Grand Old Duke of York ret = specialWords(ret, "the"); // William the Conqueror ret = specialWords(ret, "HRH"); // His/Her Royal Highness ret = specialWords(ret, "HRM"); // His/Her Royal Majesty ret = specialWords(ret, "H.R.H."); // His/Her Royal Highness ret = specialWords(ret, "H.R.M."); // His/Her Royal Majesty ret = dealWithRomanNumerals(ret); // William Gates, III return ret; } private static string properSuffix(string word, string prefix, Func<string, bool> condition = null) { if (string.IsNullOrEmpty(word)) return word; if (condition != null && ! condition(word)) return word; string lowerWord = word.ToLower(); string lowerPrefix = prefix.ToLower(); if (!lowerWord.Contains(lowerPrefix)) return word; int index = lowerWord.IndexOf(lowerPrefix); // If the search string is at the end of the word ignore. if (index + prefix.Length == word.Length) return word; return word.Substring(0, index) + prefix + capitaliseFirstLetter(word.Substring(index + prefix.Length)); } private static string specialWords(string word, string specialWord) { if (word.Equals(specialWord, StringComparison.InvariantCultureIgnoreCase)) { return specialWord; } else { return word; } } private static string dealWithRomanNumerals(string word) { // Roman Numeral parser thanks to [djk](https://stackoverflow.com/users/785111/djk) // Note that it excludes the Chinese last name Xi return new Regex(#"\b(?!Xi\b)(X|XX|XXX|XL|L|LX|LXX|LXXX|XC|C)?(I|II|III|IV|V|VI|VII|VIII|IX)?\b", RegexOptions.IgnoreCase).Replace(word, match => match.Value.ToUpperInvariant()); } private static string capitaliseFirstLetter(string word) { return char.ToUpper(word[0]) + word.Substring(1).ToLower(); } }
There's also this neat Perl script for title-casing text. http://daringfireball.net/2008/08/title_case_update #!/usr/bin/perl # This filter changes all words to Title Caps, and attempts to be clever # about *un*capitalizing small words like a/an/the in the input. # # The list of "small words" which are not capped comes from # the New York Times Manual of Style, plus 'vs' and 'v'. # # 10 May 2008 # Original version by John Gruber: # http://daringfireball.net/2008/05/title_case # # 28 July 2008 # Re-written and much improved by Aristotle Pagaltzis: # http://plasmasturm.org/code/titlecase/ # # Full change log at __END__. # # License: http://www.opensource.org/licenses/mit-license.php # use strict; use warnings; use utf8; use open qw( :encoding(UTF-8) :std ); my #small_words = qw( (?<!q&)a an and as at(?!&t) but by en for if in of on or the to v[.]? via vs[.]? ); my $small_re = join '|', #small_words; my $apos = qr/ (?: ['’] [[:lower:]]* )? /x; while ( <> ) { s{\A\s+}{}, s{\s+\z}{}; $_ = lc $_ if not /[[:lower:]]/; s{ \b (_*) (?: ( (?<=[ ][/\\]) [[:alpha:]]+ [-_[:alpha:]/\\]+ | # file path or [-_[:alpha:]]+ [#.:] [-_[:alpha:]#.:/]+ $apos ) # URL, domain, or email | ( (?i: $small_re ) $apos ) # or small word (case-insensitive) | ( [[:alpha:]] [[:lower:]'’()\[\]{}]* $apos ) # or word w/o internal caps | ( [[:alpha:]] [[:alpha:]'’()\[\]{}]* $apos ) # or some other word ) (_*) \b }{ $1 . ( defined $2 ? $2 # preserve URL, domain, or email : defined $3 ? "\L$3" # lowercase small word : defined $4 ? "\u\L$4" # capitalize word w/o internal caps : $5 # preserve other kinds of word ) . $6 }xeg; # Exceptions for small words: capitalize at start and end of title s{ ( \A [[:punct:]]* # start of title... | [:.;?!][ ]+ # or of subsentence... | [ ]['"“‘(\[][ ]* ) # or of inserted subphrase... ( $small_re ) \b # ... followed by small word }{$1\u\L$2}xig; s{ \b ( $small_re ) # small word... (?= [[:punct:]]* \Z # ... at the end of the title... | ['"’”)\]] [ ] ) # ... or of an inserted subphrase? }{\u\L$1}xig; # Exceptions for small words in hyphenated compound words ## e.g. "in-flight" -> In-Flight s{ \b (?<! -) # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (in-flight) ( $small_re ) (?= -[[:alpha:]]+) # lookahead for "-someword" }{\u\L$1}xig; ## # e.g. "Stand-in" -> "Stand-In" (Stand is already capped at this point) s{ \b (?<!…) # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (stand-in) ( [[:alpha:]]+- ) # $1 = first word and hyphen, should already be properly capped ( $small_re ) # ... followed by small word (?! - ) # Negative lookahead for another '-' }{$1\u$2}xig; print "$_"; } __END__ But it sounds like by proper case you mean.. for people's names only.
I did a quick C# port of https://github.com/tamtamchik/namecase, which is based on Lingua::EN::NameCase. public static class CIQNameCase { static Dictionary<string, string> _exceptions = new Dictionary<string, string> { {#"\bMacEdo" ,"Macedo"}, {#"\bMacEvicius" ,"Macevicius"}, {#"\bMacHado" ,"Machado"}, {#"\bMacHar" ,"Machar"}, {#"\bMacHin" ,"Machin"}, {#"\bMacHlin" ,"Machlin"}, {#"\bMacIas" ,"Macias"}, {#"\bMacIulis" ,"Maciulis"}, {#"\bMacKie" ,"Mackie"}, {#"\bMacKle" ,"Mackle"}, {#"\bMacKlin" ,"Macklin"}, {#"\bMacKmin" ,"Mackmin"}, {#"\bMacQuarie" ,"Macquarie"} }; static Dictionary<string, string> _replacements = new Dictionary<string, string> { {#"\bAl(?=\s+\w)" , #"al"}, // al Arabic or forename Al. {#"\b(Bin|Binti|Binte)\b" , #"bin"}, // bin, binti, binte Arabic {#"\bAp\b" , #"ap"}, // ap Welsh. {#"\bBen(?=\s+\w)" , #"ben"}, // ben Hebrew or forename Ben. {#"\bDell([ae])\b" , #"dell$1"}, // della and delle Italian. {#"\bD([aeiou])\b" , #"d$1"}, // da, de, di Italian; du French; do Brasil {#"\bD([ao]s)\b" , #"d$1"}, // das, dos Brasileiros {#"\bDe([lrn])\b" , #"de$1"}, // del Italian; der/den Dutch/Flemish. {#"\bEl\b" , #"el"}, // el Greek or El Spanish. {#"\bLa\b" , #"la"}, // la French or La Spanish. {#"\bL([eo])\b" , #"l$1"}, // lo Italian; le French. {#"\bVan(?=\s+\w)" , #"van"}, // van German or forename Van. {#"\bVon\b" , #"von"} // von Dutch/Flemish }; static string[] _conjunctions = { "Y", "E", "I" }; static string _romanRegex = #"\b((?:[Xx]{1,3}|[Xx][Ll]|[Ll][Xx]{0,3})?(?:[Ii]{1,3}|[Ii][VvXx]|[Vv][Ii]{0,3})?)\b"; /// <summary> /// Case a name field into its appropriate case format /// e.g. Smith, de la Cruz, Mary-Jane, O'Brien, McTaggart /// </summary> /// <param name="nameString"></param> /// <returns></returns> public static string NameCase(string nameString) { // Capitalize nameString = Capitalize(nameString); nameString = UpdateIrish(nameString); // Fixes for "son (daughter) of" etc foreach (var replacement in _replacements.Keys) { if (Regex.IsMatch(nameString, replacement)) { Regex rgx = new Regex(replacement); nameString = rgx.Replace(nameString, _replacements[replacement]); } } nameString = UpdateRoman(nameString); nameString = FixConjunction(nameString); return nameString; } /// <summary> /// Capitalize first letters. /// </summary> /// <param name="nameString"></param> /// <returns></returns> private static string Capitalize(string nameString) { nameString = nameString.ToLower(); nameString = Regex.Replace(nameString, #"\b\w", x => x.ToString().ToUpper()); nameString = Regex.Replace(nameString, #"'\w\b", x => x.ToString().ToLower()); // Lowercase 's return nameString; } /// <summary> /// Update for Irish names. /// </summary> /// <param name="nameString"></param> /// <returns></returns> private static string UpdateIrish(string nameString) { if(Regex.IsMatch(nameString, #".*?\bMac[A-Za-z^aciozj]{2,}\b") || Regex.IsMatch(nameString, #".*?\bMc")) { nameString = UpdateMac(nameString); } return nameString; } /// <summary> /// Updates irish Mac & Mc. /// </summary> /// <param name="nameString"></param> /// <returns></returns> private static string UpdateMac(string nameString) { MatchCollection matches = Regex.Matches(nameString, #"\b(Ma?c)([A-Za-z]+)"); if(matches.Count == 1 && matches[0].Groups.Count == 3) { string replacement = matches[0].Groups[1].Value; replacement += matches[0].Groups[2].Value.Substring(0, 1).ToUpper(); replacement += matches[0].Groups[2].Value.Substring(1); nameString = nameString.Replace(matches[0].Groups[0].Value, replacement); // Now fix "Mac" exceptions foreach (var exception in _exceptions.Keys) { nameString = Regex.Replace(nameString, exception, _exceptions[exception]); } } return nameString; } /// <summary> /// Fix roman numeral names. /// </summary> /// <param name="nameString"></param> /// <returns></returns> private static string UpdateRoman(string nameString) { MatchCollection matches = Regex.Matches(nameString, _romanRegex); if (matches.Count > 1) { foreach(Match match in matches) { if(!string.IsNullOrEmpty(match.Value)) { nameString = Regex.Replace(nameString, match.Value, x => x.ToString().ToUpper()); } } } return nameString; } /// <summary> /// Fix Spanish conjunctions. /// </summary> /// <param name=""></param> /// <returns></returns> private static string FixConjunction(string nameString) { foreach (var conjunction in _conjunctions) { nameString = Regex.Replace(nameString, #"\b" + conjunction + #"\b", x => x.ToString().ToLower()); } return nameString; } } Usage string name_cased = CIQNameCase.NameCase("McCarthy"); This is my test method, everything seems to pass OK: [TestMethod] public void Test_NameCase_1() { string[] names = { "Keith", "Yuri's", "Leigh-Williams", "McCarthy", // Mac exceptions "Machin", "Machlin", "Machar", "Mackle", "Macklin", "Mackie", "Macquarie", "Machado", "Macevicius", "Maciulis", "Macias", "MacMurdo", // General "O'Callaghan", "St. John", "von Streit", "van Dyke", "Van", "ap Llwyd Dafydd", "al Fahd", "Al", "el Grecco", "ben Gurion", "Ben", "da Vinci", "di Caprio", "du Pont", "de Legate", "del Crond", "der Sind", "van der Post", "van den Thillart", "von Trapp", "la Poisson", "le Figaro", "Mack Knife", "Dougal MacDonald", "Ruiz y Picasso", "Dato e Iradier", "Mas i Gavarró", // Roman numerals "Henry VIII", "Louis III", "Louis XIV", "Charles II", "Fred XLIX", "Yusof bin Ishak", }; foreach(string name in names) { string name_upper = name.ToUpper(); string name_cased = CIQNameCase.NameCase(name_upper); Console.WriteLine(string.Format("name: {0} -> {1} -> {2}", name, name_upper, name_cased)); Assert.IsTrue(name == name_cased); } }
I wrote this today to implement in an app I'm working on. I think this code is pretty self explanatory with comments. It's not 100% accurate in all cases but it will handle most of your western names easily. Examples: mary-jane => Mary-Jane o'brien => O'Brien Joël VON WINTEREGG => Joël von Winteregg jose de la acosta => Jose de la Acosta The code is extensible in that you may add any string value to the arrays at the top to suit your needs. Please study it and add any special feature that may be required. function name_title_case($str) { // name parts that should be lowercase in most cases $ok_to_be_lower = array('av','af','da','dal','de','del','der','di','la','le','van','der','den','vel','von'); // name parts that should be lower even if at the beginning of a name $always_lower = array('van', 'der'); // Create an array from the parts of the string passed in $parts = explode(" ", mb_strtolower($str)); foreach ($parts as $part) { (in_array($part, $ok_to_be_lower)) ? $rules[$part] = 'nocaps' : $rules[$part] = 'caps'; } // Determine the first part in the string reset($rules); $first_part = key($rules); // Loop through and cap-or-dont-cap foreach ($rules as $part => $rule) { if ($rule == 'caps') { // ucfirst() words and also takes into account apostrophes and hyphens like this: // O'brien -> O'Brien || mary-kaye -> Mary-Kaye $part = str_replace('- ','-',ucwords(str_replace('-','- ', $part))); $c13n[] = str_replace('\' ', '\'', ucwords(str_replace('\'', '\' ', $part))); } else if ($part == $first_part && !in_array($part, $always_lower)) { // If the first part of the string is ok_to_be_lower, cap it anyway $c13n[] = ucfirst($part); } else { $c13n[] = $part; } } $titleized = implode(' ', $c13n); return trim($titleized); }
What programming language do you use? Many languages allow callback functions for regular expression matches. These can be used to propercase the match easily. The regular expression that would be used is quite simple, you just have to match all word characters, like so: /\w+/ Alternatively, you can already extract the first character to be an extra match: /(\w)(\w*)/ Now you can access the first character and successive characters in the match separately. The callback function can then simply return a concatenation of the hits. In pseudo Python (I don't actually know Python): def make_proper(match): return match[1].to_upper + match[2] Incidentally, this would also handle the case of “O'Reilly” because “O” and “Reilly” would be matched separately and both propercased. There are however other special cases that are not handled well by the algorithm, e.g. “McDonald's” or generally any apostrophed word. The algorithm would produce “Mcdonald'S” for the latter. A special handling for apostrophe could be implemented but that would interfere with the first case. Finding a thereotical perfect solution isn't possible. In practice, it might help considering the length of the part after the apostrophe.
Here's a perhaps naive C# implementation:- public class ProperCaseHelper { public string ToProperCase(string input) { string ret = string.Empty; var words = input.Split(' '); for (int i = 0; i < words.Length; ++i) { ret += wordToProperCase(words[i]); if (i < words.Length - 1) ret += " "; } return ret; } private string wordToProperCase(string word) { if (string.IsNullOrEmpty(word)) return word; // Standard case string ret = capitaliseFirstLetter(word); // Special cases: ret = properSuffix(ret, "'"); ret = properSuffix(ret, "."); ret = properSuffix(ret, "Mc"); ret = properSuffix(ret, "Mac"); return ret; } private string properSuffix(string word, string prefix) { if(string.IsNullOrEmpty(word)) return word; string lowerWord = word.ToLower(), lowerPrefix = prefix.ToLower(); if (!lowerWord.Contains(lowerPrefix)) return word; int index = lowerWord.IndexOf(lowerPrefix); // If the search string is at the end of the word ignore. if (index + prefix.Length == word.Length) return word; return word.Substring(0, index) + prefix + capitaliseFirstLetter(word.Substring(index + prefix.Length)); } private string capitaliseFirstLetter(string word) { return char.ToUpper(word[0]) + word.Substring(1).ToLower(); } }
I know this thread has been open for awhile, but as I was doing research for this problem I came across this nifty site, which allows you to paste in names to be capitalized quite quickly: https://dialect.ca/code/name-case/. I wanted to include it here for reference for others doing similar research/projects. They release the algorithm they have written in php at this link: https://dialect.ca/code/name-case/name_case.phps A preliminary test and reading of their code suggests they have been quite thorough.
a simple way to capitalise the first letter of each word (seperated by a space) $words = explode(” “, $string); for ($i=0; $i<count($words); $i++) { $s = strtolower($words[$i]); $s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1); $result .= “$s “; } $string = trim($result); in terms of catching the "O'REILLY" example you gave splitting the string on both spaces and ' would not work as it would capitalise any letter that appeared after a apostraphe i.e. the s in Fred's so i would probably try something like $words = explode(” “, $string); for ($i=0; $i<count($words); $i++) { $s = strtolower($words[$i]); if (substr($s, 0, 2) === "o'"){ $s = substr_replace($s, strtoupper(substr($s, 0, 3)), 0, 3); }else{ $s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1); } $result .= “$s “; } $string = trim($result); This should catch O'Reilly, O'Clock, O'Donnell etc hope it helps Please note this code is untested.
Kronoz, thank you. I found in your function that the line: `if (!lowerWord.Contains(lowerPrefix)) return word`; must say if (!lowerWord.StartsWith(lowerPrefix)) return word; so "información" is not changed to "InforMacIón" best, Enrique
I use this as the textchanged event handler of text boxes. Support entry of "McDonald" Public Shared Function DoProperCaseConvert(ByVal str As String, Optional ByVal allowCapital As Boolean = True) As String Dim strCon As String = "" Dim wordbreak As String = " ,.1234567890;/\-()#$%^&*€!~+=#" Dim nextShouldBeCapital As Boolean = True 'Improve to recognize all caps input 'If str.Equals(str.ToUpper) Then ' str = str.ToLower 'End If For Each s As Char In str.ToCharArray If allowCapital Then strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s) Else strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s.ToLower) End If If wordbreak.Contains(s.ToString) Then nextShouldBeCapital = True Else nextShouldBeCapital = False End If Next Return strCon End Function
A lot of good answers here. Mine is pretty simple and only takes into account the names we have in our organization. You can expand it as you wish. This is not a perfect solution and will change vancouver to VanCouver, which is wrong. So tweak it if you use it. Here was my solution in C#. This hard-codes the names into the program but with a little work you could keep a text file outside of the program and read in the name exceptions (i.e. Van, Mc, Mac) and loop through them. public static String toProperName(String name) { if (name != null) { if (name.Length >= 2 && name.ToLower().Substring(0, 2) == "mc") // Changes mcdonald to "McDonald" return "Mc" + Regex.Replace(name.ToLower().Substring(2), #"\b[a-z]", m => m.Value.ToUpper()); if (name.Length >= 3 && name.ToLower().Substring(0, 3) == "van") // Changes vanwinkle to "VanWinkle" return "Van" + Regex.Replace(name.ToLower().Substring(3), #"\b[a-z]", m => m.Value.ToUpper()); return Regex.Replace(name.ToLower(), #"\b[a-z]", m => m.Value.ToUpper()); // Changes to title case but also fixes // appostrophes like O'HARE or o'hare to O'Hare } return ""; }
You do not mention which language you would like the solution in so here is some pseudo code. Loop through each character If the previous character was an alphabet letter Make the character lower case Otherwise Make the character upper case End loop