How to convert Chinese characters to Pinyin [closed] - sorting

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed last year.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
For sorting Chinese language text, I want to convert Chinese characters to Pinyin, properly separating each Chinese character and grouping successive characters together.
Can you please help me in this task by providing the logic or source code for doing this?
Please let me know if any open source or lib already present for this.

Short answer: you don't.
Long answer: There is no one-to-one mapping for 汉字 to 汉语拼音. Just some quick examples:
把 can be "ba" in the third tone or fourth tone.
了 can be "le" toneless or "liao" third tone.
乐 can be "le" or "yue", both in the fourth tone.
落 can be "luo", "la" or "lao", all in the fourth tone.
And so on. I have a beginners' book on this topic that has 207 examples. I stress that this is a beginners' book and is by no means complete. Each one has a page or two of examples of use and conditions under which you choose the appropriate pronunciation. It is not something that could be easily programmed (if at all).
And this doesn't even address the other slippery thing you want to deal with: the separation of characters into grouped words. The very notion of a word is a bit slippery in Chinese. (There's two terms that correspond, roughly to "word" in Chinese for example: 字 and 词. The first is the character, the second groups of characters that are put together into one concept. (I frequently get asked by Chinese speakers how many "words" I can read when they really mean "characters".) While in some cases the distinction is clear (the 词 "乌鸦", for example, is "crow" -- the two 字 must be together to express the idea properly and it would be incorrect to translate it as "black crow"), in others it is not so clear. What does "你好" translate to? Is it one word meaning, idiomatically, "hello"? Or is it two words translating literally to "you good"? Each of the characters involved stands alone or in groups with other words, but together they mean something entirely different from their individual meanings. Given this, how, precisely, do you plan to group the 汉语拼音 transliterations (which are difficult to impossible to get right in the first place!) into "words"?

While #JUST MY correct OPINION's answer addresses some of the difficulties of converting characters into pinyin, it is not an impossible problem to solve.
I have written a library (pinyinify) that solves this task with decent accuracy. Even though there is not a one-to-one mapping between characters and pinyin, my library can usually decide which pronunciation is correct. For example, "我受不了了" correctly converts to "wǒ shòubùliǎo le", with two different pronunciations of 了.
My approach to solving the problem is pretty simple:
First segment the text into words. For example, 我喜欢旅游 would be divided into three words: 我 喜欢 旅游. This is also not a simple process, but there are many libraries for it. jieba is one of the more popular libraries for this purpose.
Use a dictionary to convert the words into pinyin.
If the word is not in the dictionary, fall back to converting the individual characters to pinyin using their most common pronunciation.

CoreFoundation provides certain method to do the conversion:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("中文"));
CFStringTransform(string, NULL, kCFStringTransformMandarinLatin, NO);
CFStringTransform(string, NULL, kCFStringTransformStripDiacritics, NO);
NSLog(#"%#", string);
The output is
zhong wen

the following code writing in C# can help you to simply convert chinese words that including in gb2312 encodec(just 2312 of often used Simplified-Chinese words) to pinyin.like convert "今天天气不错" to "JinTianTianQiBuCuo".
sometimes a chinese word is not one to one map to a pinyin,it depends on the context we talk about.like the "行" in "自行车"(bike) is pronounced "Xing",but in "银行"(bank) it pronounced "Hang".so if you have problem with this,you may find more complex solution to handle this.
sorry for my poor english.i hope this could give you a little help.
public class ChineseToPinYin
{
private static int[] pyValue = new int[]
{
-20319,-20317,-20304,-20295,-20292,-20283,-20265,-20257,-20242,-20230,-20051,-20036,
-20032,-20026,-20002,-19990,-19986,-19982,-19976,-19805,-19784,-19775,-19774,-19763,
-19756,-19751,-19746,-19741,-19739,-19728,-19725,-19715,-19540,-19531,-19525,-19515,
-19500,-19484,-19479,-19467,-19289,-19288,-19281,-19275,-19270,-19263,-19261,-19249,
-19243,-19242,-19238,-19235,-19227,-19224,-19218,-19212,-19038,-19023,-19018,-19006,
-19003,-18996,-18977,-18961,-18952,-18783,-18774,-18773,-18763,-18756,-18741,-18735,
-18731,-18722,-18710,-18697,-18696,-18526,-18518,-18501,-18490,-18478,-18463,-18448,
-18447,-18446,-18239,-18237,-18231,-18220,-18211,-18201,-18184,-18183, -18181,-18012,
-17997,-17988,-17970,-17964,-17961,-17950,-17947,-17931,-17928,-17922,-17759,-17752,
-17733,-17730,-17721,-17703,-17701,-17697,-17692,-17683,-17676,-17496,-17487,-17482,
-17468,-17454,-17433,-17427,-17417,-17202,-17185,-16983,-16970,-16942,-16915,-16733,
-16708,-16706,-16689,-16664,-16657,-16647,-16474,-16470,-16465,-16459,-16452,-16448,
-16433,-16429,-16427,-16423,-16419,-16412,-16407,-16403,-16401,-16393,-16220,-16216,
-16212,-16205,-16202,-16187,-16180,-16171,-16169,-16158,-16155,-15959,-15958,-15944,
-15933,-15920,-15915,-15903,-15889,-15878,-15707,-15701,-15681,-15667,-15661,-15659,
-15652,-15640,-15631,-15625,-15454,-15448,-15436,-15435,-15419,-15416,-15408,-15394,
-15385,-15377,-15375,-15369,-15363,-15362,-15183,-15180,-15165,-15158,-15153,-15150,
-15149,-15144,-15143,-15141,-15140,-15139,-15128,-15121,-15119,-15117,-15110,-15109,
-14941,-14937,-14933,-14930,-14929,-14928,-14926,-14922,-14921,-14914,-14908,-14902,
-14894,-14889,-14882,-14873,-14871,-14857,-14678,-14674,-14670,-14668,-14663,-14654,
-14645,-14630,-14594,-14429,-14407,-14399,-14384,-14379,-14368,-14355,-14353,-14345,
-14170,-14159,-14151,-14149,-14145,-14140,-14137,-14135,-14125,-14123,-14122,-14112,
-14109,-14099,-14097,-14094,-14092,-14090,-14087,-14083,-13917,-13914,-13910,-13907,
-13906,-13905,-13896,-13894,-13878,-13870,-13859,-13847,-13831,-13658,-13611,-13601,
-13406,-13404,-13400,-13398,-13395,-13391,-13387,-13383,-13367,-13359,-13356,-13343,
-13340,-13329,-13326,-13318,-13147,-13138,-13120,-13107,-13096,-13095,-13091,-13076,
-13068,-13063,-13060,-12888,-12875,-12871,-12860,-12858,-12852,-12849,-12838,-12831,
-12829,-12812,-12802,-12607,-12597,-12594,-12585,-12556,-12359,-12346,-12320,-12300,
-12120,-12099,-12089,-12074,-12067,-12058,-12039,-11867,-11861,-11847,-11831,-11798,
-11781,-11604,-11589,-11536,-11358,-11340,-11339,-11324,-11303,-11097,-11077,-11067,
-11055,-11052,-11045,-11041,-11038,-11024,-11020,-11019,-11018,-11014,-10838,-10832,
-10815,-10800,-10790,-10780,-10764,-10587,-10544,-10533,-10519,-10331,-10329,-10328,
-10322,-10315,-10309,-10307,-10296,-10281,-10274,-10270,-10262,-10260,-10256,-10254
};
private static string[] pyName = new string[]
{
"A","Ai","An","Ang","Ao","Ba","Bai","Ban","Bang","Bao","Bei","Ben",
"Beng","Bi","Bian","Biao","Bie","Bin","Bing","Bo","Bu","Ba","Cai","Can",
"Cang","Cao","Ce","Ceng","Cha","Chai","Chan","Chang","Chao","Che","Chen","Cheng",
"Chi","Chong","Chou","Chu","Chuai","Chuan","Chuang","Chui","Chun","Chuo","Ci","Cong",
"Cou","Cu","Cuan","Cui","Cun","Cuo","Da","Dai","Dan","Dang","Dao","De",
"Deng","Di","Dian","Diao","Die","Ding","Diu","Dong","Dou","Du","Duan","Dui",
"Dun","Duo","E","En","Er","Fa","Fan","Fang","Fei","Fen","Feng","Fo",
"Fou","Fu","Ga","Gai","Gan","Gang","Gao","Ge","Gei","Gen","Geng","Gong",
"Gou","Gu","Gua","Guai","Guan","Guang","Gui","Gun","Guo","Ha","Hai","Han",
"Hang","Hao","He","Hei","Hen","Heng","Hong","Hou","Hu","Hua","Huai","Huan",
"Huang","Hui","Hun","Huo","Ji","Jia","Jian","Jiang","Jiao","Jie","Jin","Jing",
"Jiong","Jiu","Ju","Juan","Jue","Jun","Ka","Kai","Kan","Kang","Kao","Ke",
"Ken","Keng","Kong","Kou","Ku","Kua","Kuai","Kuan","Kuang","Kui","Kun","Kuo",
"La","Lai","Lan","Lang","Lao","Le","Lei","Leng","Li","Lia","Lian","Liang",
"Liao","Lie","Lin","Ling","Liu","Long","Lou","Lu","Lv","Luan","Lue","Lun",
"Luo","Ma","Mai","Man","Mang","Mao","Me","Mei","Men","Meng","Mi","Mian",
"Miao","Mie","Min","Ming","Miu","Mo","Mou","Mu","Na","Nai","Nan","Nang",
"Nao","Ne","Nei","Nen","Neng","Ni","Nian","Niang","Niao","Nie","Nin","Ning",
"Niu","Nong","Nu","Nv","Nuan","Nue","Nuo","O","Ou","Pa","Pai","Pan",
"Pang","Pao","Pei","Pen","Peng","Pi","Pian","Piao","Pie","Pin","Ping","Po",
"Pu","Qi","Qia","Qian","Qiang","Qiao","Qie","Qin","Qing","Qiong","Qiu","Qu",
"Quan","Que","Qun","Ran","Rang","Rao","Re","Ren","Reng","Ri","Rong","Rou",
"Ru","Ruan","Rui","Run","Ruo","Sa","Sai","San","Sang","Sao","Se","Sen",
"Seng","Sha","Shai","Shan","Shang","Shao","She","Shen","Sheng","Shi","Shou","Shu",
"Shua","Shuai","Shuan","Shuang","Shui","Shun","Shuo","Si","Song","Sou","Su","Suan",
"Sui","Sun","Suo","Ta","Tai","Tan","Tang","Tao","Te","Teng","Ti","Tian",
"Tiao","Tie","Ting","Tong","Tou","Tu","Tuan","Tui","Tun","Tuo","Wa","Wai",
"Wan","Wang","Wei","Wen","Weng","Wo","Wu","Xi","Xia","Xian","Xiang","Xiao",
"Xie","Xin","Xing","Xiong","Xiu","Xu","Xuan","Xue","Xun","Ya","Yan","Yang",
"Yao","Ye","Yi","Yin","Ying","Yo","Yong","You","Yu","Yuan","Yue","Yun",
"Za", "Zai","Zan","Zang","Zao","Ze","Zei","Zen","Zeng","Zha","Zhai","Zhan",
"Zhang","Zhao","Zhe","Zhen","Zheng","Zhi","Zhong","Zhou","Zhu","Zhua","Zhuai","Zhuan",
"Zhuang","Zhui","Zhun","Zhuo","Zi","Zong","Zou","Zu","Zuan","Zui","Zun","Zuo"
};
/// <summary>
/// 把汉字转换成拼音(全拼)
/// </summary>
/// <param name="hzString">汉字字符串</param>
/// <returns>转换后的拼音(全拼)字符串</returns>
public static string Convert(string hzString)
{
// 匹配中文字符
Regex regex = new Regex("^[\u4e00-\u9fa5]$");
byte[] array = new byte[2];
string pyString = "";
int chrAsc = 0;
int i1 = 0;
int i2 = 0;
char[] noWChar = hzString.ToCharArray();
for (int j = 0; j < noWChar.Length; j++)
{
// 中文字符
if (regex.IsMatch(noWChar[j].ToString()))
{
array = System.Text.Encoding.Default.GetBytes(noWChar[j].ToString());
i1 = (short)(array[0]);
i2 = (short)(array[1]);
chrAsc = i1 * 256 + i2 - 65536;
if (chrAsc > 0 && chrAsc < 160)
{
pyString += noWChar[j];
}
else
{
// 修正部分文字
if (chrAsc == -9254) // 修正“圳”字
pyString += "Zhen";
else
{
for (int i = (pyValue.Length - 1); i >= 0; i--)
{
if (pyValue[i] <= chrAsc)
{
pyString += pyName[i];
break;
}
}
}
}
}
// 非中文字符
else
{
pyString += noWChar[j].ToString();
}
}
return pyString;
}
}

You can use the following method:
from __future__ import unicode_literals
from pypinyin import lazy_pinyin
hanzi_list = ['如何', '将', '汉字','转为', '拼音']
pinyin_list = [''.join(lazy_pinyin(_)) for _ in hanzi_list]
Output:
['ruhe', 'jiang', 'hanzi', 'zhuanwei', 'pinyin']

i had this problem and i found a solution in PHP (which could be cleaner i suppose but it works). I had some troubles because the file given in this topic is from hexa unicode.
1) Import the data from ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/Uni2Pinyin.gz (thanks pierr) to your database or whatever
2) Import your data in an array as $pinyinArray[$hexaUnicode] = $pinyin;
3) Use this code:
/*
* Decimal representation of $c
* function found there: http://www.cantonese.sheik.co.uk/phorum/read.php?2,19594
*/
function uniord($c)
{
$ud = 0;
if (ord($c{0})>=0 && ord($c{0})<=127)
$ud = $c{0};
if (ord($c{0})>=192 && ord($c{0})<=223)
$ud = (ord($c{0})-192)*64 + (ord($c{1})-128);
if (ord($c{0})>=224 && ord($c{0})<=239)
$ud = (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
if (ord($c{0})>=240 && ord($c{0})<=247)
$ud = (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
if (ord($c{0})>=248 && ord($c{0})<=251)
$ud = (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
if (ord($c{0})>=252 && ord($c{0})<=253)
$ud = (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
if (ord($c{0})>=254 && ord($c{0})<=255) //error
$ud = false;
return $ud;
}
/*
* Translate the $string string of a single chinese charactere to unicode
*/
function chineseToHexaUnicode($string) {
return strtoupper(dechex(uniord($string)));
}
/*
*
*/
function convertChineseToPinyin($string,$pinyinArray) {
$pinyinValue = '';
for ($i = 0; $i < mb_strlen($string);$i++)
$pinyinValue.=$pinyinArray[chineseToHexaUnicode(mb_substr($string, $i, 1))];
return $pinyinValue;
}
$string = '龙江省五大';
echo convertChineseToPinyin($string,$pinyinArray);
echo: (long2)(jiang1)(sheng3,xing3)(wu3)(da4,dai4)
Of course, $pinyinArray is your array of data (hexoUnicode => pinyin)
Hope it will help someone.

If you use Visual Studio, this might be an option:
Microsoft.International.Converters.PinYinConverter
How to install:
First, download the Visual Studio International Pack 2.0, Official Download. Once the download is complete install the run file VSIPSetup.msi installation (x86 operating system on the default installation directory (C:\Program Files\Microsoft Visual Studio International Feature Pack 2.0).
After installation, you need to add a reference in VS, respectively reference:
C:\Program Files\Microsoft Visual Studio International Pack\Simplified Chinese Pin-Yin Conversion Library (Pinyin)
and
C:\Program Files\Microsoft Visual Studio International Pack\Traditional Chinese to Simplified Chinese Conversion Library and Add-In Tool (Traditional and Simplified Huzhuan to)
How to use:
public static string GetPinyin(string str)
{
string r = string.Empty;
foreach (char obj in str)
{
try
{
ChineseChar chineseChar = new ChineseChar(obj);
string t = chineseChar.Pinyins[0].ToString();
r += t.Substring(0, t.Length - 1);
}
catch
{
r += obj.ToString();
}
}
return r;
}
Source:
http://www.programering.com/a/MzM3cTMwATA.html

Related

AWS lambda function to speak number as digit in alexa

I have tried to use say-as interpret-as to make Alexa speak number in digits
Example - 9822 must not read in words instead '9,8,2,2'
One of the two ways I have tried is as follows:
this.emit(':tell',"Hi "+clientname+" your "+theIntentConfirmationStatus+" ticket is sent to "+ "<say-as interpret-as='digits'>" + clientno + "</say-as>",'backup');
The other one is this:
this.response.speak("Hi "+clientname+" your "+theIntentConfirmationStatus+" ticket is sent to "+ "<say-as interpret-as='digits'>" + clientno + "</say-as>");
Both are not working but working on a separate fresh function.
Actually your code SHOULD work.
Maybe you can try in test simulator and send us the code your script produces? Or the logs?
I've tried the following:
<speak>
1. The numbers are: <say-as interpret-as="digits">5498</say-as>.
2. The numbers are: <say-as interpret-as="spell-out">5498</say-as>.
3. The numbers are: <say-as interpret-as="characters">5498</say-as>.
4. The numbers are: <prosody rate="x-slow"><say-as interpret-as="digits">5498</say-as></prosody>.
5. The number is: 5498.
</speak>
Digits, Spell-out and Characters all have the effect you want.
If you want to Alexa to say it extra slow, use the prosody in #4.
Try using examples #2 or #3, maybe this works out?
Otherwise the example from Amod will work too.
You can split number into individual digits using sample function ( please test it for your possible inputs-its not tested for all input). You can search for similar function on stackoverflow
function getNumber(tablenumber) {
var number = (""+tablenumber).split("");
var arrayLength = number.length;
var tmp =" ";
for (var i = 0; i < arrayLength; i++) {
var tmp = tmp + myStringArray[i] + ", <break time=\"0.4s\"/> ";
}
return tmp;
}
In your main function... call this
var finalresult = getNumber(clientno);
this.emit(':tell',"Hi "+clientname+" your "+theIntentConfirmationStatus+" ticket is sent to "+ finalresult ,'backup');
Edited: Yep, nightflash's answer is great.
You could also break the numbers up yourself if you need other formatting, such as emphasizing particular digits, add pauses, etc. You would need to use your Lambda code to convert the numeric string to multiple digits separated by spaces and any other formatting you need.
Here's an example based on the answers in this post:
var inputNumber = 12354987;
var output = '';
var sNumber = inputNumber.toString();
for (var i = 0, len = sNumber.length; i < len; i += 1) {
// just adding spaces here, but could be SSML attributes, etc.
output = output + sNumber.charAt(i) + ' ';
}
console.log(output);
This code could be refactored and done many other ways, but I think this is about the easiest to understand.

How to create a hack proof unique code [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am creating bunch of unique codes in order to run a promotional campaign.
The campaign will run for a total of 20 million unique items. The validity of the code will be one year. I am currently looking for best possible option.
I can use only 0-9 and A-Z in the code. so that limits me to using 36 unique characters in my code. The end user will need to key in the unique cd in the system and get offers. The unique code will not be tied against any user or transaction to begin with.
One way to generate unique code is create incremental numbers and then convert them to base36 to get a unique cd. The problem with this is that its easily hackable. Users can start inserting unqiue cd in incremental fashion and redeem offers not meant for them. I am thinking of introducing some kind of randomisation. Need suggestions regarding the same.
Note - The limit of max characters in the code is 8.
Use a cryptographically strong random number generator to generate 40-bit numbers (i.e. sequences of 5-byte random arrays). Converting each array to base-36 will yield a sequence of random eight-character codes. Run an additional check on each code to make sure that there are no duplicates. Using a hash set on the converted strings will let you perform this task in a reasonable time.
Here is an example implementation in Java:
Set<String> codes = new HashSet<>();
SecureRandom rng = new SecureRandom();
byte[] data = new byte[5];
for (int i = 0 ; i != 100000 ; i++) {
rng.nextBytes(data);
long val = ((long)(data[0] & 0xFF))
| (((long)(data[1] & 0xFF)) << 8)
| (((long)(data[2] & 0xFF)) << 16)
| (((long)(data[3] & 0xFF)) << 24)
| (((long)(data[4] & 0xFF)) << 32);
String s = Long.toString(val, 36);
codes.add(s);
}
System.out.println("Generated "+codes.size()+" codes.");
Demo.
Use a Guid (C# code):
string code = Guid.NewGuid().ToString().Substring(0,8).ToUpperInvariant();
Since we have a hexadecimal representation we get digits and the characters a to f. We get 16^8 possible codes which is > 4 billion codes. One every 214 for 20 million codes.
Guid.NewGuid().ToString() yields a string like "6b984c2f-5866-4745-ac34-d5088a56070f". Since the first group has a length of 8 characters we can just take the first 8 chars and convert them to upper case. The result looks like "6B984C2F".
Note that this can yield duplicate codes. We can avoid this like this:
var codes = new HashSet<string>();
while (codes.Count < 20000000) {
string code = Guid.NewGuid().ToString().Substring(0,8).ToUpperInvariant();
codes.Add(code);
}
The HashSet allows you to add an item more than once but always only keeps one of them. (Just as math sets.)
If you want to use the full range of possible values the one-liner from above does not do it. With the whole alphabet plus digits we get 36^8 = ~2.8 * 10^12 possible codes. One every 141,055 for 20 million codes. That's better but still not completely hack proof. You will need to limit the number of entry attempts, use a CAPTCHA etc.
const string Base = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
const int CodeLength = 8;
const int NumCodes = 20000000;
var random = new Random();
var codes = new HashSet<string>();
var chars = new char[CodeLength];
while (codes.Count < NumCodes) {
for (int i = 0; i < CodeLength; i++) {
int pos = random.Next(Base.Length);
chars[i] = Base[pos];
}
string code = new string(chars);
codes.Add(code);
}

String matching alternate approach [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Trying to write my own fast pattern matching algo. Dont want to use language specific solution. I am focussing on writing the algo. This is because I was reading about different techniques to do string matching. Some are complicate yet very interesting like Rabin karp, etc.
I came up with this method which is fast and linear. It works well with the different inputs I have tried with. So I was thinking is there any reason I shouldnt be using this approach over the very well know approaches. Basically I am taking a char of text and comparing with the corresponding character of the pattern - one at a time.
Also, if someone could point out my mistake in this one - it will be great. Thank you for your replies and comments in advance :)
public static boolean patternMatch(String pattern, String text)
{
if(pattern == null)
return true;
if(text == null)
return false;
char[] patternArray = pattern.toCharArray();
char[] textArray = text.toCharArray();
int length = pattern.length();
int j = 0;
for(char t : textArray)
{
if(t == patternArray[j])
{
j++;
if(j == length)
return true;
}
else {
j = 0;
if(t == patternArray[j]) j++;
}
}
return false;
}
Two reasons for using a standard approach:
It's easy to write a method that simply does the wrong thing. Your method is like that, because it will fail to match, for instance, the pattern "ab" against the string "aab". (It matches the first "a"s of the pattern and the string, then fails to match "b" to the second "a" of the string, then goes on to see if it can find a match starting at the third character of the string.)
Standard approaches are fast. Your algorithm is linear, which is pretty good (if only it were also correct!). However, many string matching algorithms will work in sublinear time. That is, the time it takes to match a string grows more slowly than linearly in the size of the input problem. Perhaps hard to believe, but true. (Read the literature for substantiations of this claim.)

Parsing script loops in C#

I'm writing an application that will parse a script in a custom language (based slightly on C syntax and Allman style formatting) and am looking for a better (read: faster) way of parsing blocks of the script code into string arrays than the way I'm currently doing it (the current method will do, but it was more for debug than anything else).
The script contents are currently read from a file into a string array and passed to the method.
Here's a script block template:
loop [/* some conditional */ ]
{
/* a whole bunch of commands that are to be read into
* a List<string>, then converted to a string[] and
* passed to the next step for execution */
/* some command that has a bracket delimited set of
* properties or attributes */
{
/* some more commands to be acted on */
}
}
Basically, the curly bracket blocks can be nested (just like in any other C-based language), and I'm looking for the best way to find individual blocks like this.
The curly bracket delimited blocks will ALWAYS be formatted like this - the contents of the brackets will start on the line after the open bracket and will be followed by a bracket on the line after the final attribute/command/comment/whatever.
An example might be:
loop [ someVar <= 10 ]
{
informUser "Get ready to do something"
readValue
{
valueToLookFor = 0x54
timeout = 10 /* in seconds */
}
}
This would tell the app to loop whilst someVar is less than 10 (sorry for the sucking eggs comment). Each time round, we pass a message to the user and look for a specific value from somewhere (with a timeout of 10 seconds).
Here's how I'm doing it at the minute (note: the method that calls this passes the entire string[] containing the current script into it with an index to read from):
private string[] findEntireBlock(string[] scriptContents, int indexToReadFrom,
out int newIndex)
{
newIndex = 0;
int openBraceCount = 0; // for '{' char count
int closeBraceCount = 0; // for '}' char count
int openSquareCount = 0; // for '[' char count
int closeSquareCount = 0; // for ']' char count
List<string> fullblock = new List<string>();
for (int i = indexToReadFrom; i < scriptContents.Length; i++)
{
if (scriptContents[i].Contains('}'))
{
if (scriptContents[i].Contains("[") && fullblock.Count > 0)
{
//throw new exception, as we shouldn't expect to
//to find a line which starts with [ when we've already
}
else
{
if (scriptContents[i].Contains('{')) openBraceCount++;
if (scriptContents[i].Contains('}')) closeBraceCount++;
if (scriptContents[i].Contains('[')) openSquareCount++;
if (scriptContents[i].Contains(']')) closeBraceCount++;
newIndex = i;
fullblock.Add(scriptContents[i]);
break;
}
}
else
{
if (scriptContents[i].Contains("[") && fullblock.Count > 0)
{
//throw new exception, as we shouldn't expect to
//to find a line which starts with [ when we've already
}
else
{
if (scriptContents[i].Contains('{')) openBraceCount++;
if (scriptContents[i].Contains('}')) closeBraceCount++;
if (scriptContents[i].Contains('[')) openSquareCount++;
if (scriptContents[i].Contains(']')) closeBraceCount++;
fullblock.Add(scriptContents[i]);
}
}
}
if (openBraceCount == closeBraceCount &&
openSquareCount == closeSquareCount)
return fullblock.ToArray();
else
//throw new exception, the number of open brackets doesn't match
//the number of close brackets
}
I agree that this might be a slightly obtuse and slow method to follow, that's why I'm asking for any ideas on how to re-implement this for speed and clarity (if a balance can be met, that is).
I'm looking to stay away from RegEx, because I can't use it to maintain a bracket count and I'm not sure on whether you can write a RegEx statement (is that the correct term?) that can act recursively. I was thinking of working from the inside outward, but I'm convinced that would be quite slow.
I'm not looking for someone to re-write it for me, but a general idea on algorithms or techniques/libraries that I could use that would improve my method.
As a side question, how do compilers deal with multiple nested brackets in source code?
Let's Build a Compiler, by Jack Crenshaw, is a fantastic, easy-to-read introduction to building a basic compiler. The techniques discussed should help with what you're trying to do here.

Making a list of integers more human friendly

This is a bit of a side project I have taken on to solve a no-fix issue for work. Our system outputs a code to represent a combination of things on another thing. Some example codes are:
9-9-0-4-4-5-4-0-2-0-0-0-2-0-0-0-0-0-2-1-2-1-2-2-2-4
9-5-0-7-4-3-5-7-4-0-5-1-4-2-1-5-5-4-6-3-7-9-72
9-15-0-9-1-6-2-1-2-0-0-1-6-0-7
The max number in one of the slots I've seen so far is about 150 but they will likely go higher.
When the system was designed there was no requirement for what this code would look like. But now the client wants to be able to type it in by hand from a sheet of paper, something the code above isn't suited for. We've said we won't do anything about it, but it seems like a fun challenge to take on.
My question is where is a good place to start loss-less compressing this code? Obvious solutions such as store this code with a shorter key are not an option; our database is read only. I need to build a two way method to make this code more human friendly.
1) I agree that you definately need a checksum - data entry errors are very common, unless you have really well trained staff and independent duplicate keying with automatic crosss-checking.
2) I suggest http://en.wikipedia.org/wiki/Huffman_coding to turn your list of numbers into a stream of bits. To get the probabilities required for this, you need a decent sized sample of real data, so you can make a count, setting Ni to the number of times number i appears in the data. Then I suggest setting Pi = (Ni + 1) / (Sum_i (Ni + 1)) - which smooths the probabilities a bit. Also, with this method, if you see e.g. numbers 0-150 you could add a bit of slack by entering numbers 151-255 and setting them to Ni = 0. Another way round rare large numbers would be to add some sort of escape sequence.
3) Finding a way for people to type the resulting sequence of bits is really an applied psychology problem but here are some suggestions of ideas to pinch.
3a) Software licences - just encode six bits per character in some 64-character alphabet, but group characters in a way that makes it easier for people to keep place e.g. BC017-06777-14871-160C4
3b) UK car license plates. Use a change of alphabet to show people how to group characters e.g. ABCD0123EFGH4567IJKL...
3c) A really large alphabet - get yourself a list of 2^n words for some decent sized n and encode n bits as a word e.g. GREEN ENCHANTED LOGICIAN... -
i worried about this problem a while back. it turns out that you can't do much better than base64 - trying to squeeze a few more bits per character isn't really worth the effort (once you get into "strange" numbers of bits encoding and decoding becomes more complex). but at the same time, you end up with something that's likely to have errors when entered (confusing a 0 with an O etc). one option is to choose a modified set of characters and letters (so it's still base 64, but, say, you substitute ">" for "0". another is to add a checksum. again, for simplicity of implementation, i felt the checksum approach was better.
unfortunately i never got any further - things changed direction - so i can't offer code or a particular checksum choice.
ps i realised there's a missing step i didn't explain: i was going to compress the text into some binary form before encoding (using some standard compression algorithm). so to summarize: compress, add checksum, base64 encode; base 64 decode, check checksum, decompress.
This is similar to what I have used in the past. There are certainly better ways of doing this, but I used this method because it was easy to mirror in Transact-SQL which was a requirement at the time. You could certainly modify this to incorporate Huffman encoding if the distribution of your id's is non-random, but it's probably unnecessary.
You didn't specify language, so this is in c#, but it should be very easy to transition to any language. In the lookup you'll see commonly confused characters are omitted. This should speed up entry. I also had the requirement to have a fixed length, but it would be easy for you to modify this.
static public class CodeGenerator
{
static Dictionary<int, char> _lookupTable = new Dictionary<int, char>();
static CodeGenerator()
{
PrepLookupTable();
}
private static void PrepLookupTable()
{
_lookupTable.Add(0,'3');
_lookupTable.Add(1,'2');
_lookupTable.Add(2,'5');
_lookupTable.Add(3,'4');
_lookupTable.Add(4,'7');
_lookupTable.Add(5,'6');
_lookupTable.Add(6,'9');
_lookupTable.Add(7,'8');
_lookupTable.Add(8,'W');
_lookupTable.Add(9,'Q');
_lookupTable.Add(10,'E');
_lookupTable.Add(11,'T');
_lookupTable.Add(12,'R');
_lookupTable.Add(13,'Y');
_lookupTable.Add(14,'U');
_lookupTable.Add(15,'A');
_lookupTable.Add(16,'P');
_lookupTable.Add(17,'D');
_lookupTable.Add(18,'S');
_lookupTable.Add(19,'G');
_lookupTable.Add(20,'F');
_lookupTable.Add(21,'J');
_lookupTable.Add(22,'H');
_lookupTable.Add(23,'K');
_lookupTable.Add(24,'L');
_lookupTable.Add(25,'Z');
_lookupTable.Add(26,'X');
_lookupTable.Add(27,'V');
_lookupTable.Add(28,'C');
_lookupTable.Add(29,'N');
_lookupTable.Add(30,'B');
}
public static bool TryPCodeDecrypt(string iPCode, out Int64 oDecryptedInt)
{
//Prep the result so we can exit without having to fiddle with it if we hit an error.
oDecryptedInt = 0;
if (iPCode.Length > 3)
{
Char[] Bits = iPCode.ToCharArray(0,iPCode.Length-2);
int CheckInt7 = 0;
int CheckInt3 = 0;
if (!int.TryParse(iPCode[iPCode.Length-1].ToString(),out CheckInt7) ||
!int.TryParse(iPCode[iPCode.Length-2].ToString(),out CheckInt3))
{
//Unsuccessful -- the last check ints are not integers.
return false;
}
//Adjust the CheckInts to the right values.
CheckInt3 -= 2;
CheckInt7 -= 2;
int COffset = iPCode.LastIndexOf('M')+1;
Int64 tempResult = 0;
int cBPos = 0;
while ((cBPos + COffset) < Bits.Length)
{
//Calculate the current position.
int cNum = 0;
foreach (int cKey in _lookupTable.Keys)
{
if (_lookupTable[cKey] == Bits[cBPos + COffset])
{
cNum = cKey;
}
}
tempResult += cNum * (Int64)Math.Pow((double)31, (double)(Bits.Length - (cBPos + COffset + 1)));
cBPos += 1;
}
if (tempResult % 7 == CheckInt7 && tempResult % 3 == CheckInt3)
{
oDecryptedInt = tempResult;
return true;
}
return false;
}
else
{
//Unsuccessful -- too short.
return false;
}
}
public static string PCodeEncrypt(int iIntToEncrypt, int iMinLength)
{
int Check7 = (iIntToEncrypt % 7) + 2;
int Check3 = (iIntToEncrypt % 3) + 2;
StringBuilder result = new StringBuilder();
result.Insert(0, Check7);
result.Insert(0, Check3);
int workingNum = iIntToEncrypt;
while (workingNum > 0)
{
result.Insert(0, _lookupTable[workingNum % 31]);
workingNum /= 31;
}
if (result.Length < iMinLength)
{
for (int i = result.Length + 1; i <= iMinLength; i++)
{
result.Insert(0, 'M');
}
}
return result.ToString();
}
}

Resources