What's the simplest algorithm to escape a single character? - algorithm

I'm trying to write two functions escape(text, delimiter) and unescape(text, delimiter) with the following properties:
The result of escape does not contain delimiter.
unescape is the reverse of escape, i.e.
unescape(escape(text, delimiter), delimiter) == text
for all values of text and delimiter
It is OK to restrict the allowed values of delimiter.
Background: I want to create a delimiter-separated string of values. To be able to extract the same list out of the string again, I must ensure that the individual, separated strings do not contain the separator.
What I've tried: I came up with a simple solution (pseudo-code):
escape(text, delimiter): return text.Replace("\", "\\").Replace(delimiter, "\d")
unescape(text, delimiter): return text.Replace("\d", delimiter).Replace("\\", "\")
but discovered that property 2 failed on the test string "\d<delimiter>". Currently, I have the following working solution
escape(text, delimiter): return text.Replace("\", "\b").Replace(delimiter, "\d")
unescape(text, delimiter): return text.Replace("\d", delimiter).Replace("\b", "\")
which seems to work, as long as delimiter is not \, b or d (which is fine, I don't want to use those as delimiters anyway). However, since I have not formally proven its correctness, I'm afraid that I have missed some case where one of the properties is violated. Since this is such a common problem, I assume that there is already a "well-known proven-correct" algorithm for this, hence my question (see title).

Your first algorithm is correct.
The error is in the implementation of unescape(): you need to replace both \d by delimiter and \\ by \, in the same pass.
You can't use several calls to Replace() like this.
Here's some sample C# code for safe quoting of delimiter-separated strings:
static string QuoteSeparator(string str,
char separator, char quoteChar, char otherChar) // "~" -> "~~" ";" -> "~s"
{
var sb = new StringBuilder(str.Length);
foreach (char c in str)
{
if (c == quoteChar)
{
sb.Append(quoteChar);
sb.Append(quoteChar);
}
else if (c == separator)
{
sb.Append(quoteChar);
sb.Append(otherChar);
}
else
{
sb.Append(c);
}
}
return sb.ToString(); // no separator in the result -> Join/Split is safe
}
static string UnquoteSeparator(string str,
char separator, char quoteChar, char otherChar) // "~~" -> "~" "~s" -> ";"
{
var sb = new StringBuilder(str.Length);
bool isQuoted = false;
foreach (char c in str)
{
if (isQuoted)
{
if (c == otherChar)
sb.Append(separator);
else
sb.Append(c);
isQuoted = false;
}
else
{
if (c == quoteChar)
isQuoted = true;
else
sb.Append(c);
}
}
if (isQuoted)
throw new ArgumentException("input string is not correctly quoted");
return sb.ToString(); // ";" are restored
}
/// <summary>
/// Encodes the given strings as a single string.
/// </summary>
/// <param name="input">The strings.</param>
/// <param name="separator">The separator.</param>
/// <param name="quoteChar">The quote char.</param>
/// <param name="otherChar">The other char.</param>
/// <returns></returns>
public static string QuoteAndJoin(this IEnumerable<string> input,
char separator = ';', char quoteChar = '~', char otherChar = 's')
{
CommonHelper.CheckNullReference(input, "input");
if (separator == quoteChar || quoteChar == otherChar || separator == otherChar)
throw new ArgumentException("cannot quote: ambiguous format");
return string.Join(new string(separator, 1), (from str in input select QuoteSeparator(str, separator, quoteChar, otherChar)).ToArray());
}
/// <summary>
/// Decodes the strings encoded in a single string.
/// </summary>
/// <param name="encoded">The encoded.</param>
/// <param name="separator">The separator.</param>
/// <param name="quoteChar">The quote char.</param>
/// <param name="otherChar">The other char.</param>
/// <returns></returns>
public static IEnumerable<string> SplitAndUnquote(this string encoded,
char separator = ';', char quoteChar = '~', char otherChar = 's')
{
CommonHelper.CheckNullReference(encoded, "encoded");
if (separator == quoteChar || quoteChar == otherChar || separator == otherChar)
throw new ArgumentException("cannot unquote: ambiguous format");
return from s in encoded.Split(separator) select UnquoteSeparator(s, separator, quoteChar, otherChar);
}

Maybe you can have an alternative replacement for the case when the delimiter does start with \, b or d. Use the same alternative replacement in the unescape algorithm as well

Related

Java (move point in string)

I have a string = "abc";
And also I have a point ".".
How I can move this point "." in that string("abc").
Example :
Input date = "abc".
Output date = "abc", "a.bc", "ab.c", "a.b.c".
Thanks'.
public class MovePoint {
public static void main(String[] args) {
String str = "abcd";
String str1 = ".";
String[] ara = new String[str.length()];
for (int i = 0; i < str.length(); i++) {
ara[i] = str.substring(i, 1) + str1 + str.substring(1, 2);
System.out.print(Arrays.toString(ara));
}
}
}
Here is one way to do it. This uses StringBuilder as well as a plain char array to avoid having another loop over the array to build the last String, but it therefore consumes more memory.
First I print the first desired output, which is just the unmodified input String. Then I create a StringBuilder which can hold all chars from the input + one more for the chosen separator to avoid unnecessary array resizing. Then I initialize the StringBuilder so that it is in the form of the second desired ouput [char, sep, char, ...]. I am using StringBuilder here because it is just more convenient as it has the append() function that I need here.
Last but not least I also initialize a char array which will hold the values for the last String to avoid having to iterate over the array twice to generate that.
Now I loop over over the StringBuilder starting from one (as its already initialize to the first result with separator) to the last character. In this loop I do three things.
Print out the current value of StringBuilder
Swap the separator with the next character in the StringBuilder
Put the character and separator to the correct position in the char array as required for the last string
After the loop the last desired output is computed and I just have to print it to the console.
Runtime for this in BigO-notation would be O(n).
public static void main(String[] args) {
String str = "abcd";
char sep = '.';
movePoint(str, sep);
}
public static void movePoint(String str, char sep){
// print first desired output
System.out.println(str);
// String builder that can hold str.length + 1 characters, so no unnecessary resizing happens
var sb = new StringBuilder(str.length() + 1);
// fill with first char
sb.append(str.charAt(0));
// add separator
sb.append(sep);
// add rest of the string
sb.append(str.substring(1));
// Array that holds the last string
var lastStr = new char[str.length() + str.length() - 1];
for (int i = 1; i < sb.capacity() - 1; i++) {
System.out.println(sb);
// build current string
// swap separator with next character
var temp = sb.charAt(i);
sb.setCharAt(i, sb.charAt(i+1));
sb.setCharAt(i+1, temp);
// manipulate char array so last string is built correctly
int doubled = i << 1;
// set character at correct position
lastStr[doubled - 2] = sb.charAt(i-1);
// set separator at correct position
lastStr[doubled - 1] = sep;
}
// add last character of string to this char array
lastStr[lastStr.length - 1] = sb.charAt(sb.length() - 2);
// print last desired output
System.out.println(lastStr);
}
Expected output:
abcd
a.bcd
ab.cd
abc.d
a.b.c.d

Using backslashes in the windows cmd [duplicate]

Is the following behaviour some feature or a bug in C# .NET?
Test application:
using System;
using System.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Arguments:");
foreach (string arg in args)
{
Console.WriteLine(arg);
}
Console.WriteLine();
Console.WriteLine("Command Line:");
var clArgs = Environment.CommandLine.Split(' ');
foreach (string arg in clArgs.Skip(clArgs.Length - args.Length))
{
Console.WriteLine(arg);
}
Console.ReadKey();
}
}
}
Run it with command line arguments:
a "b" "\\x\\" "\x\"
In the result I receive:
Arguments:
a
b
\\x\
\x"
Command Line:
a
"b"
"\\x\\"
"\x\"
There are missing backslashes and non-removed quote in args passed to method Main(). What is the correct workaround except manually parsing Environment.CommandLine?
According to this article by Jon Galloway, there can be weird behaviour experienced when using backslashes in command line arguments.
Most notably it mentions that "Most applications (including .NET applications) use CommandLineToArgvW to decode their command lines. It uses crazy escaping rules which explain the behaviour you're seeing."
It explains that the first set of backslashes do not require escaping, but backslashes coming after alpha (maybe numeric too?) characters require escaping and that quotes always need to be escaped.
Based off of these rules, I believe to get the arguments you want you would have to pass them as:
a "b" "\\x\\\\" "\x\\"
"Whacky" indeed.
The full story of the crazy escaping rules was told in 2011 by an MS blog entry: "Everyone quotes command line arguments the wrong way"
Raymond also had something to say on the matter (already back in 2010): "What's up with the strange treatment of quotation marks and backslashes by CommandLineToArgvW"
The situation persists into 2020 and the escaping rules described in Everyone quotes command line arguments the wrong way are still correct as of 2020 and Windows 10.
I came across this same issue the other day and had a tough time getting through it. In my googling, I came across this article regarding VB.NET (the language of my application) that solved the problem without having to change any of my other code based on the arguments.
In that article, he refers to the original article which was written for C#. Here's the actual code, you pass it Environment.CommandLine():
C#
class CommandLineTools
{
/// <summary>
/// C-like argument parser
/// </summary>
/// <param name="commandLine">Command line string with arguments. Use Environment.CommandLine</param>
/// <returns>The args[] array (argv)</returns>
public static string[] CreateArgs(string commandLine)
{
StringBuilder argsBuilder = new StringBuilder(commandLine);
bool inQuote = false;
// Convert the spaces to a newline sign so we can split at newline later on
// Only convert spaces which are outside the boundries of quoted text
for (int i = 0; i < argsBuilder.Length; i++)
{
if (argsBuilder[i].Equals('"'))
{
inQuote = !inQuote;
}
if (argsBuilder[i].Equals(' ') && !inQuote)
{
argsBuilder[i] = '\n';
}
}
// Split to args array
string[] args = argsBuilder.ToString().Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
// Clean the '"' signs from the args as needed.
for (int i = 0; i < args.Length; i++)
{
args[i] = ClearQuotes(args[i]);
}
return args;
}
/// <summary>
/// Cleans quotes from the arguments.<br/>
/// All signle quotes (") will be removed.<br/>
/// Every pair of quotes ("") will transform to a single quote.<br/>
/// </summary>
/// <param name="stringWithQuotes">A string with quotes.</param>
/// <returns>The same string if its without quotes, or a clean string if its with quotes.</returns>
private static string ClearQuotes(string stringWithQuotes)
{
int quoteIndex;
if ((quoteIndex = stringWithQuotes.IndexOf('"')) == -1)
{
// String is without quotes..
return stringWithQuotes;
}
// Linear sb scan is faster than string assignemnt if quote count is 2 or more (=always)
StringBuilder sb = new StringBuilder(stringWithQuotes);
for (int i = quoteIndex; i < sb.Length; i++)
{
if (sb[i].Equals('"'))
{
// If we are not at the last index and the next one is '"', we need to jump one to preserve one
if (i != sb.Length - 1 && sb[i + 1].Equals('"'))
{
i++;
}
// We remove and then set index one backwards.
// This is because the remove itself is going to shift everything left by 1.
sb.Remove(i--, 1);
}
}
return sb.ToString();
}
}
VB.NET:
Imports System.Text
' Original version by Jonathan Levison (C#)'
' http://sleepingbits.com/2010/01/command-line-arguments-with-double-quotes-in-net/
' converted using http://www.developerfusion.com/tools/convert/csharp-to-vb/
' and then some manual effort to fix language discrepancies
Friend Class CommandLineHelper
''' <summary>
''' C-like argument parser
''' </summary>
''' <param name="commandLine">Command line string with arguments. Use Environment.CommandLine</param>
''' <returns>The args[] array (argv)</returns>
Public Shared Function CreateArgs(commandLine As String) As String()
Dim argsBuilder As New StringBuilder(commandLine)
Dim inQuote As Boolean = False
' Convert the spaces to a newline sign so we can split at newline later on
' Only convert spaces which are outside the boundries of quoted text
For i As Integer = 0 To argsBuilder.Length - 1
If argsBuilder(i).Equals(""""c) Then
inQuote = Not inQuote
End If
If argsBuilder(i).Equals(" "c) AndAlso Not inQuote Then
argsBuilder(i) = ControlChars.Lf
End If
Next
' Split to args array
Dim args As String() = argsBuilder.ToString().Split(New Char() {ControlChars.Lf}, StringSplitOptions.RemoveEmptyEntries)
' Clean the '"' signs from the args as needed.
For i As Integer = 0 To args.Length - 1
args(i) = ClearQuotes(args(i))
Next
Return args
End Function
''' <summary>
''' Cleans quotes from the arguments.<br/>
''' All signle quotes (") will be removed.<br/>
''' Every pair of quotes ("") will transform to a single quote.<br/>
''' </summary>
''' <param name="stringWithQuotes">A string with quotes.</param>
''' <returns>The same string if its without quotes, or a clean string if its with quotes.</returns>
Private Shared Function ClearQuotes(stringWithQuotes As String) As String
Dim quoteIndex As Integer = stringWithQuotes.IndexOf(""""c)
If quoteIndex = -1 Then Return stringWithQuotes
' Linear sb scan is faster than string assignemnt if quote count is 2 or more (=always)
Dim sb As New StringBuilder(stringWithQuotes)
Dim i As Integer = quoteIndex
Do While i < sb.Length
If sb(i).Equals(""""c) Then
' If we are not at the last index and the next one is '"', we need to jump one to preserve one
If i <> sb.Length - 1 AndAlso sb(i + 1).Equals(""""c) Then
i += 1
End If
' We remove and then set index one backwards.
' This is because the remove itself is going to shift everything left by 1.
sb.Remove(System.Math.Max(System.Threading.Interlocked.Decrement(i), i + 1), 1)
End If
i += 1
Loop
Return sb.ToString()
End Function
End Class
I have escaped the problem the other way...
Instead of getting arguments already parsed I am getting the arguments string as it is and then I am using my own parser:
static void Main(string[] args)
{
var param = ParseString(Environment.CommandLine);
...
}
// The following template implements the following notation:
// -key1 = some value -key2 = "some value even with '-' character " ...
private const string ParameterQuery = "\\-(?<key>\\w+)\\s*=\\s*(\"(?<value>[^\"]*)\"|(?<value>[^\\-]*))\\s*";
private static Dictionary<string, string> ParseString(string value)
{
var regex = new Regex(ParameterQuery);
return regex.Matches(value).Cast<Match>().ToDictionary(m => m.Groups["key"].Value, m => m.Groups["value"].Value);
}
This concept lets you type quotes without the escape prefix.
After much experimentation this worked for me. I'm trying to create a command to send to the Windows command line. A folder name comes after the -graphical option in the command, and since it may have spaces in it, it has to be wrapped in double quotes. When I used back slashes to create the quotes they came out as literals in the command. So this. . . .
string q = #"" + (char) 34;
string strCmdText = string.Format(#"/C cleartool update -graphical {1}{0}{1}", this.txtViewFolder.Text, q);
System.Diagnostics.Process.Start("CMD.exe", strCmdText);
q is a string holding just a double quote character. It's preceded with # to make it a verbatim string literal.
The command template is also a verbatim string literal, and the string.Format method is used to compile everything into strCmdText.
This works for me, and it works correctly with the example in the question.
/// <summary>
/// https://www.pinvoke.net/default.aspx/shell32/CommandLineToArgvW.html
/// </summary>
/// <param name="unsplitArgumentLine"></param>
/// <returns></returns>
static string[] SplitArgs(string unsplitArgumentLine)
{
int numberOfArgs;
IntPtr ptrToSplitArgs;
string[] splitArgs;
ptrToSplitArgs = CommandLineToArgvW(unsplitArgumentLine, out numberOfArgs);
// CommandLineToArgvW returns NULL upon failure.
if (ptrToSplitArgs == IntPtr.Zero)
throw new ArgumentException("Unable to split argument.", new Win32Exception());
// Make sure the memory ptrToSplitArgs to is freed, even upon failure.
try
{
splitArgs = new string[numberOfArgs];
// ptrToSplitArgs is an array of pointers to null terminated Unicode strings.
// Copy each of these strings into our split argument array.
for (int i = 0; i < numberOfArgs; i++)
splitArgs[i] = Marshal.PtrToStringUni(
Marshal.ReadIntPtr(ptrToSplitArgs, i * IntPtr.Size));
return splitArgs;
}
finally
{
// Free memory obtained by CommandLineToArgW.
LocalFree(ptrToSplitArgs);
}
}
[DllImport("shell32.dll", SetLastError = true)]
static extern IntPtr CommandLineToArgvW(
[MarshalAs(UnmanagedType.LPWStr)] string lpCmdLine,
out int pNumArgs);
[DllImport("kernel32.dll")]
static extern IntPtr LocalFree(IntPtr hMem);
static string Reverse(string s)
{
char[] charArray = s.ToCharArray();
Array.Reverse(charArray);
return new string(charArray);
}
static string GetEscapedCommandLine()
{
StringBuilder sb = new StringBuilder();
bool gotQuote = false;
foreach (var c in Environment.CommandLine.Reverse())
{
if (c == '"')
gotQuote = true;
else if (gotQuote && c == '\\')
{
// double it
sb.Append('\\');
}
else
gotQuote = false;
sb.Append(c);
}
return Reverse(sb.ToString());
}
static void Main(string[] args)
{
// Crazy hack
args = SplitArgs(GetEscapedCommandLine()).Skip(1).ToArray();
}

Join a string using delimiters

What is the best way to join a list of strings into a combined delimited string. I'm mainly concerned about when to stop adding the delimiter. I'll use C# for my examples but I would like this to be language agnostic.
EDIT: I have not used StringBuilder to make the code slightly simpler.
Use a For Loop
for(int i=0; i < list.Length; i++)
{
result += list[i];
if(i != list.Length - 1)
result += delimiter;
}
Use a For Loop setting the first item previously
result = list[0];
for(int i = 1; i < list.Length; i++)
result += delimiter + list[i];
These won't work for an IEnumerable where you don't know the length of the list beforehand so
Using a foreach loop
bool first = true;
foreach(string item in list)
{
if(!first)
result += delimiter;
result += item;
first = false;
}
Variation on a foreach loop
From Jon's solution
StringBuilder builder = new StringBuilder();
string delimiter = "";
foreach (string item in list)
{
builder.Append(delimiter);
builder.Append(item);
delimiter = ",";
}
return builder.ToString();
Using an Iterator
Again from Jon
using (IEnumerator<string> iterator = list.GetEnumerator())
{
if (!iterator.MoveNext())
return "";
StringBuilder builder = new StringBuilder(iterator.Current);
while (iterator.MoveNext())
{
builder.Append(delimiter);
builder.Append(iterator.Current);
}
return builder.ToString();
}
What other algorithms are there?
It's impossible to give a truly language-agnostic answer here as different languages and platforms handle strings differently, and provide different levels of built-in support for joining lists of strings. You could take pretty much identical code in two different languages, and it would be great in one and awful in another.
In C#, you could use:
StringBuilder builder = new StringBuilder();
string delimiter = "";
foreach (string item in list)
{
builder.Append(delimiter);
builder.Append(item);
delimiter = ",";
}
return builder.ToString();
This will prepend a comma on all but the first item. Similar code would be good in Java too.
EDIT: Here's an alternative, a bit like Ian's later answer but working on a general IEnumerable<string>.
// Change to IEnumerator for the non-generic IEnumerable
using (IEnumerator<string> iterator = list.GetEnumerator())
{
if (!iterator.MoveNext())
{
return "";
}
StringBuilder builder = new StringBuilder(iterator.Current);
while (iterator.MoveNext())
{
builder.Append(delimiter);
builder.Append(iterator.Current);
}
return builder.ToString();
}
EDIT nearly 5 years after the original answer...
In .NET 4, string.Join was overloaded pretty significantly. There's an overload taking IEnumerable<T> which automatically calls ToString, and there's an overload for IEnumerable<string>. So you don't need the code above any more... for .NET, anyway.
In .NET, you can use the String.Join method:
string concatenated = String.Join(",", list.ToArray());
Using .NET Reflector, we can find out how it does it:
public static unsafe string Join(string separator, string[] value, int startIndex, int count)
{
if (separator == null)
{
separator = Empty;
}
if (value == null)
{
throw new ArgumentNullException("value");
}
if (startIndex < 0)
{
throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_StartIndex"));
}
if (count < 0)
{
throw new ArgumentOutOfRangeException("count", Environment.GetResourceString("ArgumentOutOfRange_NegativeCount"));
}
if (startIndex > (value.Length - count))
{
throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_IndexCountBuffer"));
}
if (count == 0)
{
return Empty;
}
int length = 0;
int num2 = (startIndex + count) - 1;
for (int i = startIndex; i <= num2; i++)
{
if (value[i] != null)
{
length += value[i].Length;
}
}
length += (count - 1) * separator.Length;
if ((length < 0) || ((length + 1) < 0))
{
throw new OutOfMemoryException();
}
if (length == 0)
{
return Empty;
}
string str = FastAllocateString(length);
fixed (char* chRef = &str.m_firstChar)
{
UnSafeCharBuffer buffer = new UnSafeCharBuffer(chRef, length);
buffer.AppendString(value[startIndex]);
for (int j = startIndex + 1; j <= num2; j++)
{
buffer.AppendString(separator);
buffer.AppendString(value[j]);
}
}
return str;
}
There's little reason to make it language-agnostic when some languages provide support for this in one line, e.g., Python's
",".join(sequence)
See the join documentation for more info.
For python be sure you have a list of strings, else ','.join(x) will fail.
For a safe method using 2.5+
delimiter = '","'
delimiter.join(str(a) if a else '' for a in list_object)
The "str(a) if a else ''" is good for None types otherwise str() ends up making then 'None' which isn't nice ;)
In PHP's implode():
$string = implode($delim, $array);
I'd always add the delimeter and then remove it at the end if necessary. This way, you're not executing an if statement for every iteration of the loop when you only care about doing the work once.
StringBuilder sb = new StringBuilder();
foreach(string item in list){
sb.Append(item);
sb.Append(delimeter);
}
if (list.Count > 0) {
sb.Remove(sb.Length - delimter.Length, delimeter.Length)
}
I would express this recursively.
Check if the number of string arguments is 1. If it is, return it.
Otherwise recurse, but combine the first two arguments with the delimiter between them.
Example in Common Lisp:
(defun join (delimiter &rest strings)
(if (null (rest strings))
(first strings)
(apply #'join
delimiter
(concatenate 'string
(first strings)
delimiter
(second strings))
(cddr strings))))
The more idiomatic way is to use reduce, but this expands to almost exactly the same instructions as the above:
(defun join (delimiter &rest strings)
(reduce (lambda (a b)
(concatenate 'string a delimiter b))
strings))
List<string> aaa = new List<string>{ "aaa", "bbb", "ccc" };
string mm = ";";
return aaa.Aggregate((a, b) => a + mm + b);
and you get
aaa;bbb;ccc
lambda is pretty handy
In C# you can just use String.Join(separator,string_list)
The problem is that computer languages rarely have string booleans, that is, methods that are of type string that do anything useful. SQL Server at least has is[not]null and nullif, which when combined solve the delimiter problem, by the way: isnotnull(nullif(columnvalue, ""),"," + columnvalue))
The problem is that in languages there are booleans, and there are strings, and never the twain shall meet except in ugly coding forms, e.g.
concatstring = string1 + "," + string2;
if (fubar)
concatstring += string3
concatstring += string4 etc
I've tried mightily to avoid all this ugliness, playing comma games and concatenating with joins, but I'm still left with some of it, including SQL Server errors when I've missed one of the commas and a variable is empty.
Jonathan
Since you tagged this language agnostic,
This is how you would do it in python
# delimiter can be multichar like "| trlalala |"
delimiter = ";"
# sequence can be any list, or iterator/generator that returns list of strings
result = delimiter.join(sequence)
#result will NOT have ending delimiter
Edit: I see I got beat to the answer by several people. Sorry for dupication
I thint the best way to do something like that is (I'll use pseudo-code, so we'll make it truly language agnostic):
function concat(<array> list, <boolean> strict):
for i in list:
if the length of i is zero and strict is false:
continue;
if i is not the first element:
result = result + separator;
result = result + i;
return result;
the second argument to concat(), strict, is a flag to know if eventual empty strings have to be considered in concatenation or not.
I'm used to not consider appending a final separator; on the other hand, if strict is false the resulting string could be free of stuff like "A,B,,,F", provided the separator is a comma, but would instead present as "A,B,F".
that's how python solves the problem:
','.join(list_of_strings)
I've never could understand the need for 'algorithms' in trivial cases though
This is a Working solution in C#, in Java, you can use similar for each on iterator.
string result = string.Empty;
// use stringbuilder at some stage.
foreach (string item in list)
result += "," + item ;
result = result.Substring(1);
// output: "item,item,item"
If using .NET, you might want to use extension method so that you can do
list.ToString(",")
For details, check out Separator Delimited ToString for Array, List, Dictionary, Generic IEnumerable
// contains extension methods, it must be a static class.
public static class ExtensionMethod
{
// apply this extension to any generic IEnumerable object.
public static string ToString<T>(this IEnumerable<T> source,
string separator)
{
if (source == null)
throw new ArgumentException("source can not be null.");
if (string.IsNullOrEmpty(separator))
throw new ArgumentException("separator can not be null or empty.");
// A LINQ query to call ToString on each elements
// and constructs a string array.
string[] array =
(from s in source
select s.ToString()
).ToArray();
// utilise builtin string.Join to concate elements with
// customizable separator.
return string.Join(separator, array);
}
}
EDIT:For performance reasons, replace the concatenation code with string builder solution that mentioned within this thread.
Seen the Python answer like 3 times, but no Ruby?!?!?
the first part of the code declares a new array. Then you can just call the .join() method and pass the delimiter and it will return a string with the delimiter in the middle. I believe the join method calls the .to_s method on each item before it concatenates.
["ID", "Description", "Active"].join(",")
>> "ID, Description, Active"
this can be very useful when combining meta-programming with with database interaction.
does anyone know if c# has something similar to this syntax sugar?
In Java 8 we can use:
List<String> list = Arrays.asList(new String[] { "a", "b", "c" });
System.out.println(String.join(",", list)); //Output: a,b,c
To have a prefix and suffix we can do
StringJoiner joiner = new StringJoiner(",", "{", "}");
list.forEach(x -> joiner.add(x));
System.out.println(joiner.toString()); //Output: {a,b,c}
Prior to Java 8 you can do like Jon's answer
StringBuilder sb = new StringBuilder(prefix);
boolean and = false;
for (E e : iterable) {
if (and) {
sb.append(delimiter);
}
sb.append(e);
and = true;
}
sb.append(suffix);
In .NET, I would use the String.join method if possible, which allows you to specify a separator and a string array. A list can be converted to an array with ToArray, but I don't know what the performance hit of that would be.
The three algorithms that you mention are what I would use (I like the second because it does not have an if statement in it, but if the length is not known I would use the third because it does not duplicate the code). The second will only work if the list is not empty, so that might take another if statement.
A fourth variant might be to put a seperator in front of every element that is concatenated and then remove the first separator from the result.
If you do concatenate strings in a loop, note that for non trivial cases the use of a stringbuilder will vastly outperform repeated string concatenations.
You could write your own method AppendTostring(string, delimiter) that appends the delimiter if and only if the string is not empty. Then you just call that method in any loop without having to worry when to append and when not to append.
Edit: better yet of course to use some kind of StringBuffer in the method if available.
string result = "";
foreach(string item in list)
{
result += delimiter + item;
}
result = result.Substring(1);
Edit: Of course, you wouldn't use this or any one of your algorithms to concatenate strings. With C#/.NET, you'd probably use a StringBuilder:
StringBuilder sb = new StringBuilder();
foreach(string item in list)
{
sb.Append(delimiter);
sb.Append(item);
}
string result = sb.ToString(1, sb.Length-1);
And a variation of this solution:
StringBuilder sb = new StringBuilder(list[0]);
for (int i=1; i<list.Count; i++)
{
sb.Append(delimiter);
sb.Append(list[i]);
}
string result = sb.ToString();
Both solutions do not include any error checks.
From http://dogsblog.softwarehouse.co.zw/post/2009/02/11/IEnumerable-to-Comma-Separated-List-(and-more).aspx
A pet hate of mine when developing is making a list of comma separated ids, it is SO simple but always has ugly code.... Common solutions are to loop through and put a comma after each item then remove the last character, or to have an if statement to check if you at the begining or end of the list. Below is a solution you can use on any IEnumberable ie a List, Array etc. It is also the most efficient way I can think of doing it as it relies on assignment which is better than editing a string or using an if.
public static class StringExtensions
{
public static string Splice<T>(IEnumerable<T> args, string delimiter)
{
StringBuilder sb = new StringBuilder();
string d = "";
foreach (T t in args)
{
sb.Append(d);
sb.Append(t.ToString());
d = delimiter;
}
return sb.ToString();
}
}
Now it can be used with any IEnumerable eg.
StringExtensions.Splice(billingTransactions.Select(t => t.id), ",")
to give us 31,32,35
For java a very complete answer has been given in this question or this question.
That is use StringUtils.join in Apache Commons
String result = StringUtils.join(list, ", ");
In Clojure, you could just use clojure.contrib.str-utils/str-join:
(str-join ", " list)
But for the actual algorithm:
(reduce (fn [res cur] (str res ", " cur)) list)
Groovy also has a String Object.join(String) method.
Java (from Jon's solution):
StringBuilder sb = new StringBuilder();
String delimiter = "";
for (String item : items) {
sb.append(delimiter).append(item);
delimeter = ", ";
}
return sb.toString();
Here is my humble try;
public static string JoinWithDelimiter(List<string> words, string delimiter){
string joinedString = "";
if (words.Count() > 0)
{
joinedString = words[0] + delimiter;
for (var i = 0; i < words.Count(); i++){
if (i > 0 && i < words.Count()){
if (joinedString.Length > 0)
{
joinedString += delimiter + words[i] + delimiter;
} else {
joinedString += words[i] + delimiter;
}
}
}
}
return joinedString;
}
Usage;
List<string> words = new List<string>(){"my", "name", "is", "Hari"};
Console.WriteLine(JoinWithDelimiter(words, " "));

Canonical way to parse the command line into arguments in plain C Windows API

In a Windows program, what is the canonical way to parse the command line obtained from GetCommandLine into multiple arguments, similar to the argv array in Unix? It seems that CommandLineToArgvW does this for a Unicode command line, but I can't find a non-Unicode equivalent. Should I be using Unicode or not? If not, how do I parse the command line?
Here is an implementation of CommandLineToArgvA that delegate the work to CommandLineToArgvW, MultiByteToWideChar and WideCharToMultiByte.
LPSTR* CommandLineToArgvA(LPSTR lpCmdLine, INT *pNumArgs)
{
int retval;
retval = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, lpCmdLine, -1, NULL, 0);
if (!SUCCEEDED(retval))
return NULL;
LPWSTR lpWideCharStr = (LPWSTR)malloc(retval * sizeof(WCHAR));
if (lpWideCharStr == NULL)
return NULL;
retval = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, lpCmdLine, -1, lpWideCharStr, retval);
if (!SUCCEEDED(retval))
{
free(lpWideCharStr);
return NULL;
}
int numArgs;
LPWSTR* args;
args = CommandLineToArgvW(lpWideCharStr, &numArgs);
free(lpWideCharStr);
if (args == NULL)
return NULL;
int storage = numArgs * sizeof(LPSTR);
for (int i = 0; i < numArgs; ++ i)
{
BOOL lpUsedDefaultChar = FALSE;
retval = WideCharToMultiByte(CP_ACP, 0, args[i], -1, NULL, 0, NULL, &lpUsedDefaultChar);
if (!SUCCEEDED(retval))
{
LocalFree(args);
return NULL;
}
storage += retval;
}
LPSTR* result = (LPSTR*)LocalAlloc(LMEM_FIXED, storage);
if (result == NULL)
{
LocalFree(args);
return NULL;
}
int bufLen = storage - numArgs * sizeof(LPSTR);
LPSTR buffer = ((LPSTR)result) + numArgs * sizeof(LPSTR);
for (int i = 0; i < numArgs; ++ i)
{
assert(bufLen > 0);
BOOL lpUsedDefaultChar = FALSE;
retval = WideCharToMultiByte(CP_ACP, 0, args[i], -1, buffer, bufLen, NULL, &lpUsedDefaultChar);
if (!SUCCEEDED(retval))
{
LocalFree(result);
LocalFree(args);
return NULL;
}
result[i] = buffer;
buffer += retval;
bufLen -= retval;
}
LocalFree(args);
*pNumArgs = numArgs;
return result;
}
Apparently you can use __argv outside main() to access the pre-parsed argument vector...
I followed the source for parse_cmd (see "argv_parsing.cpp" in the latest SDK) and modified it to match the paradigm and operation for CommandLineToArgW and developed the following. Note: instead of using LocalAlloc, per Microsoft recommendations (see https://msdn.microsoft.com/en-us/library/windows/desktop/aa366723(v=vs.85).aspx) I've substituted HeapAlloc. Additionally one change in the SAL notation. I deviate slightly be stating _In_opt_ for lpCmdLine - as CommandLineToArgvW does allow this to be NULL, in which case it returns an argument list containing just the program name.
A final caveat, parse_cmd will parse the command line slightly different from CommandLineToArgvW in one aspect only: two double quote characters in a row while the state is 'in quote' mode are interpreted as an escaped double quote character. Both functions consume the first one and output the second one. The difference is that for CommandLineToArgvW, there is a transition out of 'in quote' mode, while parse_cmdline remains in 'in quote' mode. This is properly reflected in the function below.
You would use the below function as follows:
int argc = 0;
LPSTR *argv = CommandLineToArgvA(GetCommandLineA(), &argc);
HeapFree(GetProcessHeap(), NULL, argv);
LPSTR* CommandLineToArgvA(_In_opt_ LPCSTR lpCmdLine, _Out_ int *pNumArgs)
{
if (!pNumArgs)
{
SetLastError(ERROR_INVALID_PARAMETER);
return NULL;
}
*pNumArgs = 0;
/*follow CommandLinetoArgvW and if lpCmdLine is NULL return the path to the executable.
Use 'programname' so that we don't have to allocate MAX_PATH * sizeof(CHAR) for argv
every time. Since this is ANSI the return can't be greater than MAX_PATH (260
characters)*/
CHAR programname[MAX_PATH] = {};
/*pnlength = the length of the string that is copied to the buffer, in characters, not
including the terminating null character*/
DWORD pnlength = GetModuleFileNameA(NULL, programname, MAX_PATH);
if (pnlength == 0) //error getting program name
{
//GetModuleFileNameA will SetLastError
return NULL;
}
if (*lpCmdLine == NULL)
{
/*In keeping with CommandLineToArgvW the caller should make a single call to HeapFree
to release the memory of argv. Allocate a single block of memory with space for two
pointers (representing argv[0] and argv[1]). argv[0] will contain a pointer to argv+2
where the actual program name will be stored. argv[1] will be nullptr per the C++
specifications for argv. Hence space required is the size of a LPSTR (char*) multiplied
by 2 [pointers] + the length of the program name (+1 for null terminating character)
multiplied by the sizeof CHAR. HeapAlloc is called with HEAP_GENERATE_EXCEPTIONS flag,
so if there is a failure on allocating memory an exception will be generated.*/
LPSTR *argv = static_cast<LPSTR*>(HeapAlloc(GetProcessHeap(),
HEAP_ZERO_MEMORY | HEAP_GENERATE_EXCEPTIONS,
(sizeof(LPSTR) * 2) + ((pnlength + 1) * sizeof(CHAR))));
memcpy(argv + 2, programname, pnlength+1); //add 1 for the terminating null character
argv[0] = reinterpret_cast<LPSTR>(argv + 2);
argv[1] = nullptr;
*pNumArgs = 1;
return argv;
}
/*We need to determine the number of arguments and the number of characters so that the
proper amount of memory can be allocated for argv. Our argument count starts at 1 as the
first "argument" is the program name even if there are no other arguments per specs.*/
int argc = 1;
int numchars = 0;
LPCSTR templpcl = lpCmdLine;
bool in_quotes = false; //'in quotes' mode is off (false) or on (true)
/*first scan the program name and copy it. The handling is much simpler than for other
arguments. Basically, whatever lies between the leading double-quote and next one, or a
terminal null character is simply accepted. Fancier handling is not required because the
program name must be a legal NTFS/HPFS file name. Note that the double-quote characters are
not copied.*/
do {
if (*templpcl == '"')
{
//don't add " to character count
in_quotes = !in_quotes;
templpcl++; //move to next character
continue;
}
++numchars; //count character
templpcl++; //move to next character
if (_ismbblead(*templpcl) != 0) //handle MBCS
{
++numchars;
templpcl++; //skip over trail byte
}
} while (*templpcl != '\0' && (in_quotes || (*templpcl != ' ' && *templpcl != '\t')));
//parsed first argument
if (*templpcl == '\0')
{
/*no more arguments, rewind and the next for statement will handle*/
templpcl--;
}
//loop through the remaining arguments
int slashcount = 0; //count of backslashes
bool countorcopychar = true; //count the character or not
for (;;)
{
if (*templpcl)
{
//next argument begins with next non-whitespace character
while (*templpcl == ' ' || *templpcl == '\t')
++templpcl;
}
if (*templpcl == '\0')
break; //end of arguments
++argc; //next argument - increment argument count
//loop through this argument
for (;;)
{
/*Rules:
2N backslashes + " ==> N backslashes and begin/end quote
2N + 1 backslashes + " ==> N backslashes + literal "
N backslashes ==> N backslashes*/
slashcount = 0;
countorcopychar = true;
while (*templpcl == '\\')
{
//count the number of backslashes for use below
++templpcl;
++slashcount;
}
if (*templpcl == '"')
{
//if 2N backslashes before, start/end quote, otherwise count.
if (slashcount % 2 == 0) //even number of backslashes
{
if (in_quotes && *(templpcl +1) == '"')
{
in_quotes = !in_quotes; //NB: parse_cmdline omits this line
templpcl++; //double quote inside quoted string
}
else
{
//skip first quote character and count second
countorcopychar = false;
in_quotes = !in_quotes;
}
}
slashcount /= 2;
}
//count slashes
while (slashcount--)
{
++numchars;
}
if (*templpcl == '\0' || (!in_quotes && (*templpcl == ' ' || *templpcl == '\t')))
{
//at the end of the argument - break
break;
}
if (countorcopychar)
{
if (_ismbblead(*templpcl) != 0) //should copy another character for MBCS
{
++templpcl; //skip over trail byte
++numchars;
}
++numchars;
}
++templpcl;
}
//add a count for the null-terminating character
++numchars;
}
/*allocate memory for argv. Allocate a single block of memory with space for argc number of
pointers. argv[0] will contain a pointer to argv+argc where the actual program name will be
stored. argv[argc] will be nullptr per the C++ specifications. Hence space required is the
size of a LPSTR (char*) multiplied by argc + 1 pointers + the number of characters counted
above multiplied by the sizeof CHAR. HeapAlloc is called with HEAP_GENERATE_EXCEPTIONS
flag, so if there is a failure on allocating memory an exception will be generated.*/
LPSTR *argv = static_cast<LPSTR*>(HeapAlloc(GetProcessHeap(),
HEAP_ZERO_MEMORY | HEAP_GENERATE_EXCEPTIONS,
(sizeof(LPSTR) * (argc+1)) + (numchars * sizeof(CHAR))));
//now loop through the commandline again and split out arguments
in_quotes = false;
templpcl = lpCmdLine;
argv[0] = reinterpret_cast<LPSTR>(argv + argc+1);
LPSTR tempargv = reinterpret_cast<LPSTR>(argv + argc+1);
do {
if (*templpcl == '"')
{
in_quotes = !in_quotes;
templpcl++; //move to next character
continue;
}
*tempargv++ = *templpcl;
templpcl++; //move to next character
if (_ismbblead(*templpcl) != 0) //should copy another character for MBCS
{
*tempargv++ = *templpcl; //copy second byte
templpcl++; //skip over trail byte
}
} while (*templpcl != '\0' && (in_quotes || (*templpcl != ' ' && *templpcl != '\t')));
//parsed first argument
if (*templpcl == '\0')
{
//no more arguments, rewind and the next for statement will handle
templpcl--;
}
else
{
//end of program name - add null terminator
*tempargv = '\0';
}
int currentarg = 1;
argv[currentarg] = ++tempargv;
//loop through the remaining arguments
slashcount = 0; //count of backslashes
countorcopychar = true; //count the character or not
for (;;)
{
if (*templpcl)
{
//next argument begins with next non-whitespace character
while (*templpcl == ' ' || *templpcl == '\t')
++templpcl;
}
if (*templpcl == '\0')
break; //end of arguments
argv[currentarg] = ++tempargv; //copy address of this argument string
//next argument - loop through it's characters
for (;;)
{
/*Rules:
2N backslashes + " ==> N backslashes and begin/end quote
2N + 1 backslashes + " ==> N backslashes + literal "
N backslashes ==> N backslashes*/
slashcount = 0;
countorcopychar = true;
while (*templpcl == '\\')
{
//count the number of backslashes for use below
++templpcl;
++slashcount;
}
if (*templpcl == '"')
{
//if 2N backslashes before, start/end quote, otherwise copy literally.
if (slashcount % 2 == 0) //even number of backslashes
{
if (in_quotes && *(templpcl+1) == '"')
{
in_quotes = !in_quotes; //NB: parse_cmdline omits this line
templpcl++; //double quote inside quoted string
}
else
{
//skip first quote character and count second
countorcopychar = false;
in_quotes = !in_quotes;
}
}
slashcount /= 2;
}
//copy slashes
while (slashcount--)
{
*tempargv++ = '\\';
}
if (*templpcl == '\0' || (!in_quotes && (*templpcl == ' ' || *templpcl == '\t')))
{
//at the end of the argument - break
break;
}
if (countorcopychar)
{
*tempargv++ = *templpcl;
if (_ismbblead(*templpcl) != 0) //should copy another character for MBCS
{
++templpcl; //skip over trail byte
*tempargv++ = *templpcl;
}
}
++templpcl;
}
//null-terminate the argument
*tempargv = '\0';
++currentarg;
}
argv[argc] = nullptr;
*pNumArgs = argc;
return argv;
}
CommandLineToArgvW() is in shell32.dll. I'd guessthat the Shell developers created the function for their own use, and it was made public either because someone decided that 3rd party devs would find it useful or because some court action made them do it.
Since the Shell developers only ever needed a Unicode version that's all they ever wrote. It would be fairly simple to write an ANSI wrapper for the function that converts the ANSI to Unicode, calls the function and converts the Unicode results to ANSI (and if Shell32.dll ever provided an ANSI variant of this API, that's probably exactly what would do).
None of these solved the problem perfectly when don't want parsing UNICODE, so my solution is modified from WINE projects, they contains source code of CommandLineToArgvW of shell32.dll, modified it to below and it's work perfectly for me:
/*************************************************************************
* CommandLineToArgvA [SHELL32.#]
*
* MODIFIED FROM https://www.winehq.org/ project
* We must interpret the quotes in the command line to rebuild the argv
* array correctly:
* - arguments are separated by spaces or tabs
* - quotes serve as optional argument delimiters
* '"a b"' -> 'a b'
* - escaped quotes must be converted back to '"'
* '\"' -> '"'
* - consecutive backslashes preceding a quote see their number halved with
* the remainder escaping the quote:
* 2n backslashes + quote -> n backslashes + quote as an argument delimiter
* 2n+1 backslashes + quote -> n backslashes + literal quote
* - backslashes that are not followed by a quote are copied literally:
* 'a\b' -> 'a\b'
* 'a\\b' -> 'a\\b'
* - in quoted strings, consecutive quotes see their number divided by three
* with the remainder modulo 3 deciding whether to close the string or not.
* Note that the opening quote must be counted in the consecutive quotes,
* that's the (1+) below:
* (1+) 3n quotes -> n quotes
* (1+) 3n+1 quotes -> n quotes plus closes the quoted string
* (1+) 3n+2 quotes -> n+1 quotes plus closes the quoted string
* - in unquoted strings, the first quote opens the quoted string and the
* remaining consecutive quotes follow the above rule.
*/
LPSTR* WINAPI CommandLineToArgvA(LPSTR lpCmdline, int* numargs)
{
DWORD argc;
LPSTR *argv;
LPSTR s;
LPSTR d;
LPSTR cmdline;
int qcount,bcount;
if(!numargs || *lpCmdline==0)
{
SetLastError(ERROR_INVALID_PARAMETER);
return NULL;
}
/* --- First count the arguments */
argc=1;
s=lpCmdline;
/* The first argument, the executable path, follows special rules */
if (*s=='"')
{
/* The executable path ends at the next quote, no matter what */
s++;
while (*s)
if (*s++=='"')
break;
}
else
{
/* The executable path ends at the next space, no matter what */
while (*s && *s!=' ' && *s!='\t')
s++;
}
/* skip to the first argument, if any */
while (*s==' ' || *s=='\t')
s++;
if (*s)
argc++;
/* Analyze the remaining arguments */
qcount=bcount=0;
while (*s)
{
if ((*s==' ' || *s=='\t') && qcount==0)
{
/* skip to the next argument and count it if any */
while (*s==' ' || *s=='\t')
s++;
if (*s)
argc++;
bcount=0;
}
else if (*s=='\\')
{
/* '\', count them */
bcount++;
s++;
}
else if (*s=='"')
{
/* '"' */
if ((bcount & 1)==0)
qcount++; /* unescaped '"' */
s++;
bcount=0;
/* consecutive quotes, see comment in copying code below */
while (*s=='"')
{
qcount++;
s++;
}
qcount=qcount % 3;
if (qcount==2)
qcount=0;
}
else
{
/* a regular character */
bcount=0;
s++;
}
}
/* Allocate in a single lump, the string array, and the strings that go
* with it. This way the caller can make a single LocalFree() call to free
* both, as per MSDN.
*/
argv=LocalAlloc(LMEM_FIXED, (argc+1)*sizeof(LPSTR)+(strlen(lpCmdline)+1)*sizeof(char));
if (!argv)
return NULL;
cmdline=(LPSTR)(argv+argc+1);
strcpy(cmdline, lpCmdline);
/* --- Then split and copy the arguments */
argv[0]=d=cmdline;
argc=1;
/* The first argument, the executable path, follows special rules */
if (*d=='"')
{
/* The executable path ends at the next quote, no matter what */
s=d+1;
while (*s)
{
if (*s=='"')
{
s++;
break;
}
*d++=*s++;
}
}
else
{
/* The executable path ends at the next space, no matter what */
while (*d && *d!=' ' && *d!='\t')
d++;
s=d;
if (*s)
s++;
}
/* close the executable path */
*d++=0;
/* skip to the first argument and initialize it if any */
while (*s==' ' || *s=='\t')
s++;
if (!*s)
{
/* There are no parameters so we are all done */
argv[argc]=NULL;
*numargs=argc;
return argv;
}
/* Split and copy the remaining arguments */
argv[argc++]=d;
qcount=bcount=0;
while (*s)
{
if ((*s==' ' || *s=='\t') && qcount==0)
{
/* close the argument */
*d++=0;
bcount=0;
/* skip to the next one and initialize it if any */
do {
s++;
} while (*s==' ' || *s=='\t');
if (*s)
argv[argc++]=d;
}
else if (*s=='\\')
{
*d++=*s++;
bcount++;
}
else if (*s=='"')
{
if ((bcount & 1)==0)
{
/* Preceded by an even number of '\', this is half that
* number of '\', plus a quote which we erase.
*/
d-=bcount/2;
qcount++;
}
else
{
/* Preceded by an odd number of '\', this is half that
* number of '\' followed by a '"'
*/
d=d-bcount/2-1;
*d++='"';
}
s++;
bcount=0;
/* Now count the number of consecutive quotes. Note that qcount
* already takes into account the opening quote if any, as well as
* the quote that lead us here.
*/
while (*s=='"')
{
if (++qcount==3)
{
*d++='"';
qcount=0;
}
s++;
}
if (qcount==2)
qcount=0;
}
else
{
/* a regular character */
*d++=*s++;
bcount=0;
}
}
*d='\0';
argv[argc]=NULL;
*numargs=argc;
return argv;
}
Be careful when parsing empty string "", it's return NULL instead of executable path, that's the different behavior with the standard CommandLineToArgvW, the recommanded usage is below:
int argc;
LPSTR * argv = CommandLineToArgvA(GetCommandLineA(), &argc);
// AFTER consumed argv
LocalFree(argv);
The following is about the simplest way I can think of to obtain an old-fashioned argc/argv pair at the top of WinMain. Assuming that the command-line really was ANSI text, you don't actually need any conversions fancier than this.
int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) {
int argc;
LPWSTR *szArglist = CommandLineToArgvW(GetCommandLineW(), &argc);
char **argv = new char*[argc];
for (int i=0; i<argc; i++) {
int lgth = wcslen(szArglist[i]);
argv[i] = new char[lgth+1];
for (int j=0; j<=lgth; j++)
argv[i][j] = char(szArglist[i][j]);
}
LocalFree(szArglist);

Does anyone have a good Proper Case algorithm

Does anyone have a trusted Proper Case or PCase algorithm (similar to a UCase or Upper)? I'm looking for something that takes a value such as "GEORGE BURDELL" or "george burdell" and turns it into "George Burdell".
I have a simple one that handles the simple cases. The ideal would be to have something that can handle things such as "O'REILLY" and turn it into "O'Reilly", but I know that is tougher.
I am mainly focused on the English language if that simplifies things.
UPDATE: I'm using C# as the language, but I can convert from almost anything (assuming like functionality exists).
I agree that the McDonald's scneario is a tough one. I meant to mention that along with my O'Reilly example, but did not in the original post.
Unless I've misunderstood your question I don't think you need to roll your own, the TextInfo class can do it for you.
using System.Globalization;
CultureInfo.InvariantCulture.TextInfo.ToTitleCase("GeOrGE bUrdEll")
Will return "George Burdell. And you can use your own culture if there's some special rules involved.
Update: Michael (in a comment to this answer) pointed out that this will not work if the input is all caps since the method will assume that it is an acronym. The naive workaround for this is to .ToLower() the text before submitting it to ToTitleCase.
#zwol: I'll post it as a separate reply.
Here's an example based on ljs's post.
void Main()
{
List<string> names = new List<string>() {
"bill o'reilly",
"johannes diderik van der waals",
"mr. moseley-williams",
"Joe VanWyck",
"mcdonald's",
"william the third",
"hrh prince charles",
"h.r.m. queen elizabeth the third",
"william gates, iii",
"pope leo xii",
"a.k. jennings"
};
names.Select(name => name.ToProperCase()).Dump();
}
// http://stackoverflow.com/questions/32149/does-anyone-have-a-good-proper-case-algorithm
public static class ProperCaseHelper
{
public static string ToProperCase(this string input)
{
if (IsAllUpperOrAllLower(input))
{
// fix the ALL UPPERCASE or all lowercase names
return string.Join(" ", input.Split(' ').Select(word => wordToProperCase(word)));
}
else
{
// leave the CamelCase or Propercase names alone
return input;
}
}
public static bool IsAllUpperOrAllLower(this string input)
{
return (input.ToLower().Equals(input) || input.ToUpper().Equals(input));
}
private static string wordToProperCase(string word)
{
if (string.IsNullOrEmpty(word)) return word;
// Standard case
string ret = capitaliseFirstLetter(word);
// Special cases:
ret = properSuffix(ret, "'"); // D'Artagnon, D'Silva
ret = properSuffix(ret, "."); // ???
ret = properSuffix(ret, "-"); // Oscar-Meyer-Weiner
ret = properSuffix(ret, "Mc", t => t.Length > 4); // Scots
ret = properSuffix(ret, "Mac", t => t.Length > 5); // Scots except Macey
// Special words:
ret = specialWords(ret, "van"); // Dick van Dyke
ret = specialWords(ret, "von"); // Baron von Bruin-Valt
ret = specialWords(ret, "de");
ret = specialWords(ret, "di");
ret = specialWords(ret, "da"); // Leonardo da Vinci, Eduardo da Silva
ret = specialWords(ret, "of"); // The Grand Old Duke of York
ret = specialWords(ret, "the"); // William the Conqueror
ret = specialWords(ret, "HRH"); // His/Her Royal Highness
ret = specialWords(ret, "HRM"); // His/Her Royal Majesty
ret = specialWords(ret, "H.R.H."); // His/Her Royal Highness
ret = specialWords(ret, "H.R.M."); // His/Her Royal Majesty
ret = dealWithRomanNumerals(ret); // William Gates, III
return ret;
}
private static string properSuffix(string word, string prefix, Func<string, bool> condition = null)
{
if (string.IsNullOrEmpty(word)) return word;
if (condition != null && ! condition(word)) return word;
string lowerWord = word.ToLower();
string lowerPrefix = prefix.ToLower();
if (!lowerWord.Contains(lowerPrefix)) return word;
int index = lowerWord.IndexOf(lowerPrefix);
// If the search string is at the end of the word ignore.
if (index + prefix.Length == word.Length) return word;
return word.Substring(0, index) + prefix +
capitaliseFirstLetter(word.Substring(index + prefix.Length));
}
private static string specialWords(string word, string specialWord)
{
if (word.Equals(specialWord, StringComparison.InvariantCultureIgnoreCase))
{
return specialWord;
}
else
{
return word;
}
}
private static string dealWithRomanNumerals(string word)
{
// Roman Numeral parser thanks to [djk](https://stackoverflow.com/users/785111/djk)
// Note that it excludes the Chinese last name Xi
return new Regex(#"\b(?!Xi\b)(X|XX|XXX|XL|L|LX|LXX|LXXX|XC|C)?(I|II|III|IV|V|VI|VII|VIII|IX)?\b", RegexOptions.IgnoreCase).Replace(word, match => match.Value.ToUpperInvariant());
}
private static string capitaliseFirstLetter(string word)
{
return char.ToUpper(word[0]) + word.Substring(1).ToLower();
}
}
There's also this neat Perl script for title-casing text.
http://daringfireball.net/2008/08/title_case_update
#!/usr/bin/perl
# This filter changes all words to Title Caps, and attempts to be clever
# about *un*capitalizing small words like a/an/the in the input.
#
# The list of "small words" which are not capped comes from
# the New York Times Manual of Style, plus 'vs' and 'v'.
#
# 10 May 2008
# Original version by John Gruber:
# http://daringfireball.net/2008/05/title_case
#
# 28 July 2008
# Re-written and much improved by Aristotle Pagaltzis:
# http://plasmasturm.org/code/titlecase/
#
# Full change log at __END__.
#
# License: http://www.opensource.org/licenses/mit-license.php
#
use strict;
use warnings;
use utf8;
use open qw( :encoding(UTF-8) :std );
my #small_words = qw( (?<!q&)a an and as at(?!&t) but by en for if in of on or the to v[.]? via vs[.]? );
my $small_re = join '|', #small_words;
my $apos = qr/ (?: ['’] [[:lower:]]* )? /x;
while ( <> ) {
s{\A\s+}{}, s{\s+\z}{};
$_ = lc $_ if not /[[:lower:]]/;
s{
\b (_*) (?:
( (?<=[ ][/\\]) [[:alpha:]]+ [-_[:alpha:]/\\]+ | # file path or
[-_[:alpha:]]+ [#.:] [-_[:alpha:]#.:/]+ $apos ) # URL, domain, or email
|
( (?i: $small_re ) $apos ) # or small word (case-insensitive)
|
( [[:alpha:]] [[:lower:]'’()\[\]{}]* $apos ) # or word w/o internal caps
|
( [[:alpha:]] [[:alpha:]'’()\[\]{}]* $apos ) # or some other word
) (_*) \b
}{
$1 . (
defined $2 ? $2 # preserve URL, domain, or email
: defined $3 ? "\L$3" # lowercase small word
: defined $4 ? "\u\L$4" # capitalize word w/o internal caps
: $5 # preserve other kinds of word
) . $6
}xeg;
# Exceptions for small words: capitalize at start and end of title
s{
( \A [[:punct:]]* # start of title...
| [:.;?!][ ]+ # or of subsentence...
| [ ]['"“‘(\[][ ]* ) # or of inserted subphrase...
( $small_re ) \b # ... followed by small word
}{$1\u\L$2}xig;
s{
\b ( $small_re ) # small word...
(?= [[:punct:]]* \Z # ... at the end of the title...
| ['"’”)\]] [ ] ) # ... or of an inserted subphrase?
}{\u\L$1}xig;
# Exceptions for small words in hyphenated compound words
## e.g. "in-flight" -> In-Flight
s{
\b
(?<! -) # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (in-flight)
( $small_re )
(?= -[[:alpha:]]+) # lookahead for "-someword"
}{\u\L$1}xig;
## # e.g. "Stand-in" -> "Stand-In" (Stand is already capped at this point)
s{
\b
(?<!…) # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (stand-in)
( [[:alpha:]]+- ) # $1 = first word and hyphen, should already be properly capped
( $small_re ) # ... followed by small word
(?! - ) # Negative lookahead for another '-'
}{$1\u$2}xig;
print "$_";
}
__END__
But it sounds like by proper case you mean.. for people's names only.
I did a quick C# port of https://github.com/tamtamchik/namecase, which is based on Lingua::EN::NameCase.
public static class CIQNameCase
{
static Dictionary<string, string> _exceptions = new Dictionary<string, string>
{
{#"\bMacEdo" ,"Macedo"},
{#"\bMacEvicius" ,"Macevicius"},
{#"\bMacHado" ,"Machado"},
{#"\bMacHar" ,"Machar"},
{#"\bMacHin" ,"Machin"},
{#"\bMacHlin" ,"Machlin"},
{#"\bMacIas" ,"Macias"},
{#"\bMacIulis" ,"Maciulis"},
{#"\bMacKie" ,"Mackie"},
{#"\bMacKle" ,"Mackle"},
{#"\bMacKlin" ,"Macklin"},
{#"\bMacKmin" ,"Mackmin"},
{#"\bMacQuarie" ,"Macquarie"}
};
static Dictionary<string, string> _replacements = new Dictionary<string, string>
{
{#"\bAl(?=\s+\w)" , #"al"}, // al Arabic or forename Al.
{#"\b(Bin|Binti|Binte)\b" , #"bin"}, // bin, binti, binte Arabic
{#"\bAp\b" , #"ap"}, // ap Welsh.
{#"\bBen(?=\s+\w)" , #"ben"}, // ben Hebrew or forename Ben.
{#"\bDell([ae])\b" , #"dell$1"}, // della and delle Italian.
{#"\bD([aeiou])\b" , #"d$1"}, // da, de, di Italian; du French; do Brasil
{#"\bD([ao]s)\b" , #"d$1"}, // das, dos Brasileiros
{#"\bDe([lrn])\b" , #"de$1"}, // del Italian; der/den Dutch/Flemish.
{#"\bEl\b" , #"el"}, // el Greek or El Spanish.
{#"\bLa\b" , #"la"}, // la French or La Spanish.
{#"\bL([eo])\b" , #"l$1"}, // lo Italian; le French.
{#"\bVan(?=\s+\w)" , #"van"}, // van German or forename Van.
{#"\bVon\b" , #"von"} // von Dutch/Flemish
};
static string[] _conjunctions = { "Y", "E", "I" };
static string _romanRegex = #"\b((?:[Xx]{1,3}|[Xx][Ll]|[Ll][Xx]{0,3})?(?:[Ii]{1,3}|[Ii][VvXx]|[Vv][Ii]{0,3})?)\b";
/// <summary>
/// Case a name field into its appropriate case format
/// e.g. Smith, de la Cruz, Mary-Jane, O'Brien, McTaggart
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
public static string NameCase(string nameString)
{
// Capitalize
nameString = Capitalize(nameString);
nameString = UpdateIrish(nameString);
// Fixes for "son (daughter) of" etc
foreach (var replacement in _replacements.Keys)
{
if (Regex.IsMatch(nameString, replacement))
{
Regex rgx = new Regex(replacement);
nameString = rgx.Replace(nameString, _replacements[replacement]);
}
}
nameString = UpdateRoman(nameString);
nameString = FixConjunction(nameString);
return nameString;
}
/// <summary>
/// Capitalize first letters.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string Capitalize(string nameString)
{
nameString = nameString.ToLower();
nameString = Regex.Replace(nameString, #"\b\w", x => x.ToString().ToUpper());
nameString = Regex.Replace(nameString, #"'\w\b", x => x.ToString().ToLower()); // Lowercase 's
return nameString;
}
/// <summary>
/// Update for Irish names.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateIrish(string nameString)
{
if(Regex.IsMatch(nameString, #".*?\bMac[A-Za-z^aciozj]{2,}\b") || Regex.IsMatch(nameString, #".*?\bMc"))
{
nameString = UpdateMac(nameString);
}
return nameString;
}
/// <summary>
/// Updates irish Mac & Mc.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateMac(string nameString)
{
MatchCollection matches = Regex.Matches(nameString, #"\b(Ma?c)([A-Za-z]+)");
if(matches.Count == 1 && matches[0].Groups.Count == 3)
{
string replacement = matches[0].Groups[1].Value;
replacement += matches[0].Groups[2].Value.Substring(0, 1).ToUpper();
replacement += matches[0].Groups[2].Value.Substring(1);
nameString = nameString.Replace(matches[0].Groups[0].Value, replacement);
// Now fix "Mac" exceptions
foreach (var exception in _exceptions.Keys)
{
nameString = Regex.Replace(nameString, exception, _exceptions[exception]);
}
}
return nameString;
}
/// <summary>
/// Fix roman numeral names.
/// </summary>
/// <param name="nameString"></param>
/// <returns></returns>
private static string UpdateRoman(string nameString)
{
MatchCollection matches = Regex.Matches(nameString, _romanRegex);
if (matches.Count > 1)
{
foreach(Match match in matches)
{
if(!string.IsNullOrEmpty(match.Value))
{
nameString = Regex.Replace(nameString, match.Value, x => x.ToString().ToUpper());
}
}
}
return nameString;
}
/// <summary>
/// Fix Spanish conjunctions.
/// </summary>
/// <param name=""></param>
/// <returns></returns>
private static string FixConjunction(string nameString)
{
foreach (var conjunction in _conjunctions)
{
nameString = Regex.Replace(nameString, #"\b" + conjunction + #"\b", x => x.ToString().ToLower());
}
return nameString;
}
}
Usage
string name_cased = CIQNameCase.NameCase("McCarthy");
This is my test method, everything seems to pass OK:
[TestMethod]
public void Test_NameCase_1()
{
string[] names = {
"Keith", "Yuri's", "Leigh-Williams", "McCarthy",
// Mac exceptions
"Machin", "Machlin", "Machar",
"Mackle", "Macklin", "Mackie",
"Macquarie", "Machado", "Macevicius",
"Maciulis", "Macias", "MacMurdo",
// General
"O'Callaghan", "St. John", "von Streit",
"van Dyke", "Van", "ap Llwyd Dafydd",
"al Fahd", "Al",
"el Grecco",
"ben Gurion", "Ben",
"da Vinci",
"di Caprio", "du Pont", "de Legate",
"del Crond", "der Sind", "van der Post", "van den Thillart",
"von Trapp", "la Poisson", "le Figaro",
"Mack Knife", "Dougal MacDonald",
"Ruiz y Picasso", "Dato e Iradier", "Mas i Gavarró",
// Roman numerals
"Henry VIII", "Louis III", "Louis XIV",
"Charles II", "Fred XLIX", "Yusof bin Ishak",
};
foreach(string name in names)
{
string name_upper = name.ToUpper();
string name_cased = CIQNameCase.NameCase(name_upper);
Console.WriteLine(string.Format("name: {0} -> {1} -> {2}", name, name_upper, name_cased));
Assert.IsTrue(name == name_cased);
}
}
I wrote this today to implement in an app I'm working on. I think this code is pretty self explanatory with comments. It's not 100% accurate in all cases but it will handle most of your western names easily.
Examples:
mary-jane => Mary-Jane
o'brien => O'Brien
Joël VON WINTEREGG => Joël von Winteregg
jose de la acosta => Jose de la Acosta
The code is extensible in that you may add any string value to the arrays at the top to suit your needs. Please study it and add any special feature that may be required.
function name_title_case($str)
{
// name parts that should be lowercase in most cases
$ok_to_be_lower = array('av','af','da','dal','de','del','der','di','la','le','van','der','den','vel','von');
// name parts that should be lower even if at the beginning of a name
$always_lower = array('van', 'der');
// Create an array from the parts of the string passed in
$parts = explode(" ", mb_strtolower($str));
foreach ($parts as $part)
{
(in_array($part, $ok_to_be_lower)) ? $rules[$part] = 'nocaps' : $rules[$part] = 'caps';
}
// Determine the first part in the string
reset($rules);
$first_part = key($rules);
// Loop through and cap-or-dont-cap
foreach ($rules as $part => $rule)
{
if ($rule == 'caps')
{
// ucfirst() words and also takes into account apostrophes and hyphens like this:
// O'brien -> O'Brien || mary-kaye -> Mary-Kaye
$part = str_replace('- ','-',ucwords(str_replace('-','- ', $part)));
$c13n[] = str_replace('\' ', '\'', ucwords(str_replace('\'', '\' ', $part)));
}
else if ($part == $first_part && !in_array($part, $always_lower))
{
// If the first part of the string is ok_to_be_lower, cap it anyway
$c13n[] = ucfirst($part);
}
else
{
$c13n[] = $part;
}
}
$titleized = implode(' ', $c13n);
return trim($titleized);
}
What programming language do you use? Many languages allow callback functions for regular expression matches. These can be used to propercase the match easily. The regular expression that would be used is quite simple, you just have to match all word characters, like so:
/\w+/
Alternatively, you can already extract the first character to be an extra match:
/(\w)(\w*)/
Now you can access the first character and successive characters in the match separately. The callback function can then simply return a concatenation of the hits. In pseudo Python (I don't actually know Python):
def make_proper(match):
return match[1].to_upper + match[2]
Incidentally, this would also handle the case of “O'Reilly” because “O” and “Reilly” would be matched separately and both propercased. There are however other special cases that are not handled well by the algorithm, e.g. “McDonald's” or generally any apostrophed word. The algorithm would produce “Mcdonald'S” for the latter. A special handling for apostrophe could be implemented but that would interfere with the first case. Finding a thereotical perfect solution isn't possible. In practice, it might help considering the length of the part after the apostrophe.
Here's a perhaps naive C# implementation:-
public class ProperCaseHelper {
public string ToProperCase(string input) {
string ret = string.Empty;
var words = input.Split(' ');
for (int i = 0; i < words.Length; ++i) {
ret += wordToProperCase(words[i]);
if (i < words.Length - 1) ret += " ";
}
return ret;
}
private string wordToProperCase(string word) {
if (string.IsNullOrEmpty(word)) return word;
// Standard case
string ret = capitaliseFirstLetter(word);
// Special cases:
ret = properSuffix(ret, "'");
ret = properSuffix(ret, ".");
ret = properSuffix(ret, "Mc");
ret = properSuffix(ret, "Mac");
return ret;
}
private string properSuffix(string word, string prefix) {
if(string.IsNullOrEmpty(word)) return word;
string lowerWord = word.ToLower(), lowerPrefix = prefix.ToLower();
if (!lowerWord.Contains(lowerPrefix)) return word;
int index = lowerWord.IndexOf(lowerPrefix);
// If the search string is at the end of the word ignore.
if (index + prefix.Length == word.Length) return word;
return word.Substring(0, index) + prefix +
capitaliseFirstLetter(word.Substring(index + prefix.Length));
}
private string capitaliseFirstLetter(string word) {
return char.ToUpper(word[0]) + word.Substring(1).ToLower();
}
}
I know this thread has been open for awhile, but as I was doing research for this problem I came across this nifty site, which allows you to paste in names to be capitalized quite quickly: https://dialect.ca/code/name-case/. I wanted to include it here for reference for others doing similar research/projects.
They release the algorithm they have written in php at this link: https://dialect.ca/code/name-case/name_case.phps
A preliminary test and reading of their code suggests they have been quite thorough.
a simple way to capitalise the first letter of each word (seperated by a space)
$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {
$s = strtolower($words[$i]);
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
$result .= “$s “;
}
$string = trim($result);
in terms of catching the "O'REILLY" example you gave
splitting the string on both spaces and ' would not work as it would capitalise any letter that appeared after a apostraphe i.e. the s in Fred's
so i would probably try something like
$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {
$s = strtolower($words[$i]);
if (substr($s, 0, 2) === "o'"){
$s = substr_replace($s, strtoupper(substr($s, 0, 3)), 0, 3);
}else{
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
}
$result .= “$s “;
}
$string = trim($result);
This should catch O'Reilly, O'Clock, O'Donnell etc hope it helps
Please note this code is untested.
Kronoz, thank you. I found in your function that the line:
`if (!lowerWord.Contains(lowerPrefix)) return word`;
must say
if (!lowerWord.StartsWith(lowerPrefix)) return word;
so "información" is not changed to "InforMacIón"
best,
Enrique
I use this as the textchanged event handler of text boxes. Support entry of "McDonald"
Public Shared Function DoProperCaseConvert(ByVal str As String, Optional ByVal allowCapital As Boolean = True) As String
Dim strCon As String = ""
Dim wordbreak As String = " ,.1234567890;/\-()#$%^&*€!~+=#"
Dim nextShouldBeCapital As Boolean = True
'Improve to recognize all caps input
'If str.Equals(str.ToUpper) Then
' str = str.ToLower
'End If
For Each s As Char In str.ToCharArray
If allowCapital Then
strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s)
Else
strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s.ToLower)
End If
If wordbreak.Contains(s.ToString) Then
nextShouldBeCapital = True
Else
nextShouldBeCapital = False
End If
Next
Return strCon
End Function
A lot of good answers here. Mine is pretty simple and only takes into account the names we have in our organization. You can expand it as you wish. This is not a perfect solution and will change vancouver to VanCouver, which is wrong. So tweak it if you use it.
Here was my solution in C#. This hard-codes the names into the program but with a little work you could keep a text file outside of the program and read in the name exceptions (i.e. Van, Mc, Mac) and loop through them.
public static String toProperName(String name)
{
if (name != null)
{
if (name.Length >= 2 && name.ToLower().Substring(0, 2) == "mc") // Changes mcdonald to "McDonald"
return "Mc" + Regex.Replace(name.ToLower().Substring(2), #"\b[a-z]", m => m.Value.ToUpper());
if (name.Length >= 3 && name.ToLower().Substring(0, 3) == "van") // Changes vanwinkle to "VanWinkle"
return "Van" + Regex.Replace(name.ToLower().Substring(3), #"\b[a-z]", m => m.Value.ToUpper());
return Regex.Replace(name.ToLower(), #"\b[a-z]", m => m.Value.ToUpper()); // Changes to title case but also fixes
// appostrophes like O'HARE or o'hare to O'Hare
}
return "";
}
You do not mention which language you would like the solution in so here is some pseudo code.
Loop through each character
If the previous character was an alphabet letter
Make the character lower case
Otherwise
Make the character upper case
End loop

Resources