Parsing script loops in C# - algorithm

I'm writing an application that will parse a script in a custom language (based slightly on C syntax and Allman style formatting) and am looking for a better (read: faster) way of parsing blocks of the script code into string arrays than the way I'm currently doing it (the current method will do, but it was more for debug than anything else).
The script contents are currently read from a file into a string array and passed to the method.
Here's a script block template:
loop [/* some conditional */ ]
{
/* a whole bunch of commands that are to be read into
* a List<string>, then converted to a string[] and
* passed to the next step for execution */
/* some command that has a bracket delimited set of
* properties or attributes */
{
/* some more commands to be acted on */
}
}
Basically, the curly bracket blocks can be nested (just like in any other C-based language), and I'm looking for the best way to find individual blocks like this.
The curly bracket delimited blocks will ALWAYS be formatted like this - the contents of the brackets will start on the line after the open bracket and will be followed by a bracket on the line after the final attribute/command/comment/whatever.
An example might be:
loop [ someVar <= 10 ]
{
informUser "Get ready to do something"
readValue
{
valueToLookFor = 0x54
timeout = 10 /* in seconds */
}
}
This would tell the app to loop whilst someVar is less than 10 (sorry for the sucking eggs comment). Each time round, we pass a message to the user and look for a specific value from somewhere (with a timeout of 10 seconds).
Here's how I'm doing it at the minute (note: the method that calls this passes the entire string[] containing the current script into it with an index to read from):
private string[] findEntireBlock(string[] scriptContents, int indexToReadFrom,
out int newIndex)
{
newIndex = 0;
int openBraceCount = 0; // for '{' char count
int closeBraceCount = 0; // for '}' char count
int openSquareCount = 0; // for '[' char count
int closeSquareCount = 0; // for ']' char count
List<string> fullblock = new List<string>();
for (int i = indexToReadFrom; i < scriptContents.Length; i++)
{
if (scriptContents[i].Contains('}'))
{
if (scriptContents[i].Contains("[") && fullblock.Count > 0)
{
//throw new exception, as we shouldn't expect to
//to find a line which starts with [ when we've already
}
else
{
if (scriptContents[i].Contains('{')) openBraceCount++;
if (scriptContents[i].Contains('}')) closeBraceCount++;
if (scriptContents[i].Contains('[')) openSquareCount++;
if (scriptContents[i].Contains(']')) closeBraceCount++;
newIndex = i;
fullblock.Add(scriptContents[i]);
break;
}
}
else
{
if (scriptContents[i].Contains("[") && fullblock.Count > 0)
{
//throw new exception, as we shouldn't expect to
//to find a line which starts with [ when we've already
}
else
{
if (scriptContents[i].Contains('{')) openBraceCount++;
if (scriptContents[i].Contains('}')) closeBraceCount++;
if (scriptContents[i].Contains('[')) openSquareCount++;
if (scriptContents[i].Contains(']')) closeBraceCount++;
fullblock.Add(scriptContents[i]);
}
}
}
if (openBraceCount == closeBraceCount &&
openSquareCount == closeSquareCount)
return fullblock.ToArray();
else
//throw new exception, the number of open brackets doesn't match
//the number of close brackets
}
I agree that this might be a slightly obtuse and slow method to follow, that's why I'm asking for any ideas on how to re-implement this for speed and clarity (if a balance can be met, that is).
I'm looking to stay away from RegEx, because I can't use it to maintain a bracket count and I'm not sure on whether you can write a RegEx statement (is that the correct term?) that can act recursively. I was thinking of working from the inside outward, but I'm convinced that would be quite slow.
I'm not looking for someone to re-write it for me, but a general idea on algorithms or techniques/libraries that I could use that would improve my method.
As a side question, how do compilers deal with multiple nested brackets in source code?

Let's Build a Compiler, by Jack Crenshaw, is a fantastic, easy-to-read introduction to building a basic compiler. The techniques discussed should help with what you're trying to do here.

Related

What is the most efficient way to replace a list of words without touching html attributes?

I absolutely disagree that this question is a duplicate! I am asking for an efficiency way to replace hundreds of words at once. This is an algorithm question! All the provided links are about to replace one word. Should I repeat that expensive operation hundreds of times? I'm sure that there are better ways as a suffix tree where I sort out html while building that tree. I removed that regex tag since for no good reason you are focusing on that part.
I want to translate a given set of words (more then 100) and translate them. My first idea was to use a simple regular expression that works better then expected. As sample:
const input = "I like strawberry cheese cake with apple juice"
const en2de = {
apple: "Apfel",
strawberry: "Erdbeere",
banana: "Banane",
/* ... */}
input.replace(new RegExp(Object.keys(en2de).join("|"), "gi"), match => en2de[match.toLowerCase()])
This works fine on the first view. However it become strange if you words which contains each other like "pineapple" that would return "pineApfel" which is totally nonsense. So I was thinking about checking word boundaries and similar things. While playing around I created this test case:
Apple is a company
That created the output:
Apfel is a company.
The translation is wrong, which is somehow tolerable, but the link is broken. That must not happen.
So I was thinking about extend the regex to check if there is a quote before. I know well that html parsing with regex is a bad idea, but I thought that this should work anyway. In the end I gave up and was looking for solutions of other devs and found on Stack Overflow a couple of questions, all without answers, so it seems to be a hard problem (or my search skills are bad).
So I went two steps back and was thinking to implement that myself with a parser or something like that. However since I have multiple inputs and I need to ignore the case I was thinking what the best way is.
Right now I think to build a dictionary with pointers to the position of the words. I would store the dict in lower case so that should be fast, I could also skip all words with the wrong prefix etc to get my matches. In the end I would replace the words from the end to the beginning to avoid breaking the indices. But is that way efficiency? Is there a better way to achieve that?
While my sample is in JavaScript the solution must not be in JS as long the solution doesn't include dozens of dependencies which cannot be translated easy to JS.
TL;DR:
I want to replace multiple words by other words in a case insensitive way without breaking html.
You may try a treeWalker and replace the text inplace.
To find words you may tokenize your text, lower case your words and map them.
const mapText = (dic, s) => {
return s.replace(/[a-zA-Z-_]+/g, w => {
return dic[w.toLowerCase()] || w
})
}
const dic = {
'grodzi': 'grodzila',
'laaaa': 'forever',
}
const treeWalker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT
)
// skip body node
let currentNode = treeWalker.nextNode()
while(currentNode) {
const newS = mapText(dic, currentNode.data)
currentNode.data = newS
currentNode = treeWalker.nextNode()
}
p {background:#eeeeee;}
<p>
grodzi
LAAAA
</p>
The link stay untouched.
However mapping each word in an other language is bound to fail (be it missing representation of some word, humour/irony, or simply grammar construct). For this matter (which is a hard problem on its own) you may rely on some tools to translate data for you (neural networks, api(s), ...)
Here is my current work in progress solution of a suffix tree (or at least how I interpreted it). I'm building a dictionary with all words, which are not inside of a tag, with their position. After sorting the dict I replace them all. This works for me without handling html at all.
function suffixTree(input) {
const dict = new Map()
let start = 0
let insideTag = false
// define word borders
const borders = ' .,<"\'(){}\r\n\t'.split('')
// build dictionary
for (let end = 0; end <= input.length; end++) {
const c = input[end]
if (c === '<') insideTag = true
if (c === '>') {
insideTag = false
start = end + 1
continue
}
if (insideTag && c !== '<') continue
if (borders.indexOf(c) >= 0) {
if(start !== end) {
const word = input.substring(start, end).toLowerCase()
const entry = dict.get(word) || []
// save the word and its position in an array so when the word is
// used multiple times that we can use this list
entry.push([start, end])
dict.set(word, entry)
}
start = end + 1
}
}
// last word handling
const word = input.substring(start).toLowerCase()
const entry = dict.get(word) || []
entry.push([start, input.length])
dict.set(word, entry)
// create a list of replace operations, we would break the
// indices if we do that directly
const words = Object.keys(en2de)
const replacements = []
words.forEach(word => {
(dict.get(word) || []).forEach(match => {
// [0] is start, [1] is end, [2] is the replacement
replacements.push([match[0], match[1], en2de[word]])
})
})
// converting the input to a char array and replacing the found ranges.
// beginning from the end and replace the ranges with the replacement
let output = [...input]
replacements.sort((a, b) => b[0] - a[0])
replacements.forEach(match => {
output.splice(match[0], match[1] - match[0], match[2])
})
return output.join('')
}
Feel free to leave a comment how this can be improved.

Converting an if code into forloop statement

Right now i have to write a code that will print out "*" for each collectedDots, however if it doesn't collect any collectedDots==0 then it print out "". Using too many if statements look messy and i was wandering how you would implement the forloop in this case.
As a general principle the kind of rearrangement you've done here is good. You have found a way to express the rule in a general way rather than as a sequence of special cases. This is much easier to reason about and to check, and it's obviously extensible to cases where you have more than 3 dots.
You probably have made an error in confusing your target number and the iteration value, I assume that collectedDots contains the number of dots you have (as per your if statement) and so you need to introduce a variable to count up to that value
for (int i =0; i <= collectedDots; i++)
{
stars = "*";
System.out.print(stars);
}
Ok, so you already have a variable called collectedDots that is a number which tells you how many stars to print?
So your loop would be something like
for every collected dot
print *
But you can't just print it out, you need to return a string that will be printed out. So it's more like
for every collected dot
add a * to our string
return the string
They key difference between this and your attempt so far is that you were assigning a star to be your string each time through the loop, then at the end of it, you return that string–no matter how many times you assign a star to the string, the string will always just be one star.
You also need a separate variable to keep track of your loop, this should do the trick:
String stars = "";
for(int i = 0; i < collectedDots; i++)
{
stars = stars + "*";
}
return stars;
You are almost correct. Just need to change range limit of looping. Looping initial value is set to 1. So whenever you have collectedDots = 0, it will not go in loop and will return "", as stars is intialized with "" before loop.
String stars = "";
for (int i =1; i <= collectedDots; i++)
{
stars = "*";
System.out.print(stars);
}
return stars;

Two semicolons inside a for-loop parentheses

I'm customising a code I found over the internet (it's the Adafruit Tweet Receipt). I cannot understand many parts of the code but the most perplexing to me is the for-loop with two semicolons inside the parentheses
boolean jsonParse(int depth, byte endChar) {
int c, i;
boolean readName = true;
for(;;) { //<---------
while(isspace(c = timedRead())); // Scan past whitespace
if(c < 0) return false; // Timeout
if(c == endChar) return true; // EOD
if(c == '{') { // Object follows
if(!jsonParse(depth + 1, '}')) return false;
if(!depth) return true; // End of file
if(depth == resultsDepth) { // End of object in results list
What does for(;;) mean? (It's an Arduino program so I guess it's in C).
for(;;) {
}
functionally means
while (true) {
}
It will probably break the loop/ return from loop based on some condition inside the loop body.
The reason that for(;;) loops forever is because for has three parts, each of which is optional. The first part initializes the loop; the second decides whether or not to continue the loop, and the third does something at the end of each iteration. It is full form, you would typically see something like this:
for(i = 0; i < 10; i++)
If the first (initialization) or last (end-of-iteration) parts are missing, nothing is done in their place. If the middle (test) part is missing, then it acts as though true were there in its place. So for(;;) is the same as for(;true;)', which (as shown above) is the same as while (true).
The for loop has 3 components, separated by semi-colons. The first component runs before the looping starts and is commonly used to initialize a variable. The second is a condition. The condition is checked at the beginning of each iteration, and if it evaluates to true, then the code in the loop runs. The third components is executed at the end of the loop, before another iteration (starting with condition check) begins, and is often used to increment a variable.
In your case for(;;) means that it will loop forever since the condition is not present. The loop ends when the code returns or breaks.
Each clause of a for loop is optional. So when they are excluded, it still loops. for loops compile into while loops.
The end result becomes a check to initialize any variables, which concludes after nothing happening since it is empty, a check to the boolean condition in the second clause, which is not present so the loop starts, and once the loop hits the end bracket, a check to see if there is any code to run before checking the boolean condition again.
In code it looks like:
while(true){
}
Here's What Wikipedia Says About it
Use as infinite loops
This C-style for-loop is commonly the source of an infinite loop since the fundamental steps of iteration are completely in the control of the programmer. In fact, when infinite loops are intended, this type of for-loop can be used (with empty expressions), such as:
for (;;)
//loop body
This style is used instead of infinite while(1) loops to avoid a type conversion warning in some C/C++ compilers.Some programmers prefer the more succinct for(;;) form over the semantically equivalent but more verbose while (true) form.

How to convert Chinese characters to Pinyin [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed last year.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
For sorting Chinese language text, I want to convert Chinese characters to Pinyin, properly separating each Chinese character and grouping successive characters together.
Can you please help me in this task by providing the logic or source code for doing this?
Please let me know if any open source or lib already present for this.
Short answer: you don't.
Long answer: There is no one-to-one mapping for 汉字 to 汉语拼音. Just some quick examples:
把 can be "ba" in the third tone or fourth tone.
了 can be "le" toneless or "liao" third tone.
乐 can be "le" or "yue", both in the fourth tone.
落 can be "luo", "la" or "lao", all in the fourth tone.
And so on. I have a beginners' book on this topic that has 207 examples. I stress that this is a beginners' book and is by no means complete. Each one has a page or two of examples of use and conditions under which you choose the appropriate pronunciation. It is not something that could be easily programmed (if at all).
And this doesn't even address the other slippery thing you want to deal with: the separation of characters into grouped words. The very notion of a word is a bit slippery in Chinese. (There's two terms that correspond, roughly to "word" in Chinese for example: 字 and 词. The first is the character, the second groups of characters that are put together into one concept. (I frequently get asked by Chinese speakers how many "words" I can read when they really mean "characters".) While in some cases the distinction is clear (the 词 "乌鸦", for example, is "crow" -- the two 字 must be together to express the idea properly and it would be incorrect to translate it as "black crow"), in others it is not so clear. What does "你好" translate to? Is it one word meaning, idiomatically, "hello"? Or is it two words translating literally to "you good"? Each of the characters involved stands alone or in groups with other words, but together they mean something entirely different from their individual meanings. Given this, how, precisely, do you plan to group the 汉语拼音 transliterations (which are difficult to impossible to get right in the first place!) into "words"?
While #JUST MY correct OPINION's answer addresses some of the difficulties of converting characters into pinyin, it is not an impossible problem to solve.
I have written a library (pinyinify) that solves this task with decent accuracy. Even though there is not a one-to-one mapping between characters and pinyin, my library can usually decide which pronunciation is correct. For example, "我受不了了" correctly converts to "wǒ shòubùliǎo le", with two different pronunciations of 了.
My approach to solving the problem is pretty simple:
First segment the text into words. For example, 我喜欢旅游 would be divided into three words: 我 喜欢 旅游. This is also not a simple process, but there are many libraries for it. jieba is one of the more popular libraries for this purpose.
Use a dictionary to convert the words into pinyin.
If the word is not in the dictionary, fall back to converting the individual characters to pinyin using their most common pronunciation.
CoreFoundation provides certain method to do the conversion:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("中文"));
CFStringTransform(string, NULL, kCFStringTransformMandarinLatin, NO);
CFStringTransform(string, NULL, kCFStringTransformStripDiacritics, NO);
NSLog(#"%#", string);
The output is
zhong wen
the following code writing in C# can help you to simply convert chinese words that including in gb2312 encodec(just 2312 of often used Simplified-Chinese words) to pinyin.like convert "今天天气不错" to "JinTianTianQiBuCuo".
sometimes a chinese word is not one to one map to a pinyin,it depends on the context we talk about.like the "行" in "自行车"(bike) is pronounced "Xing",but in "银行"(bank) it pronounced "Hang".so if you have problem with this,you may find more complex solution to handle this.
sorry for my poor english.i hope this could give you a little help.
public class ChineseToPinYin
{
private static int[] pyValue = new int[]
{
-20319,-20317,-20304,-20295,-20292,-20283,-20265,-20257,-20242,-20230,-20051,-20036,
-20032,-20026,-20002,-19990,-19986,-19982,-19976,-19805,-19784,-19775,-19774,-19763,
-19756,-19751,-19746,-19741,-19739,-19728,-19725,-19715,-19540,-19531,-19525,-19515,
-19500,-19484,-19479,-19467,-19289,-19288,-19281,-19275,-19270,-19263,-19261,-19249,
-19243,-19242,-19238,-19235,-19227,-19224,-19218,-19212,-19038,-19023,-19018,-19006,
-19003,-18996,-18977,-18961,-18952,-18783,-18774,-18773,-18763,-18756,-18741,-18735,
-18731,-18722,-18710,-18697,-18696,-18526,-18518,-18501,-18490,-18478,-18463,-18448,
-18447,-18446,-18239,-18237,-18231,-18220,-18211,-18201,-18184,-18183, -18181,-18012,
-17997,-17988,-17970,-17964,-17961,-17950,-17947,-17931,-17928,-17922,-17759,-17752,
-17733,-17730,-17721,-17703,-17701,-17697,-17692,-17683,-17676,-17496,-17487,-17482,
-17468,-17454,-17433,-17427,-17417,-17202,-17185,-16983,-16970,-16942,-16915,-16733,
-16708,-16706,-16689,-16664,-16657,-16647,-16474,-16470,-16465,-16459,-16452,-16448,
-16433,-16429,-16427,-16423,-16419,-16412,-16407,-16403,-16401,-16393,-16220,-16216,
-16212,-16205,-16202,-16187,-16180,-16171,-16169,-16158,-16155,-15959,-15958,-15944,
-15933,-15920,-15915,-15903,-15889,-15878,-15707,-15701,-15681,-15667,-15661,-15659,
-15652,-15640,-15631,-15625,-15454,-15448,-15436,-15435,-15419,-15416,-15408,-15394,
-15385,-15377,-15375,-15369,-15363,-15362,-15183,-15180,-15165,-15158,-15153,-15150,
-15149,-15144,-15143,-15141,-15140,-15139,-15128,-15121,-15119,-15117,-15110,-15109,
-14941,-14937,-14933,-14930,-14929,-14928,-14926,-14922,-14921,-14914,-14908,-14902,
-14894,-14889,-14882,-14873,-14871,-14857,-14678,-14674,-14670,-14668,-14663,-14654,
-14645,-14630,-14594,-14429,-14407,-14399,-14384,-14379,-14368,-14355,-14353,-14345,
-14170,-14159,-14151,-14149,-14145,-14140,-14137,-14135,-14125,-14123,-14122,-14112,
-14109,-14099,-14097,-14094,-14092,-14090,-14087,-14083,-13917,-13914,-13910,-13907,
-13906,-13905,-13896,-13894,-13878,-13870,-13859,-13847,-13831,-13658,-13611,-13601,
-13406,-13404,-13400,-13398,-13395,-13391,-13387,-13383,-13367,-13359,-13356,-13343,
-13340,-13329,-13326,-13318,-13147,-13138,-13120,-13107,-13096,-13095,-13091,-13076,
-13068,-13063,-13060,-12888,-12875,-12871,-12860,-12858,-12852,-12849,-12838,-12831,
-12829,-12812,-12802,-12607,-12597,-12594,-12585,-12556,-12359,-12346,-12320,-12300,
-12120,-12099,-12089,-12074,-12067,-12058,-12039,-11867,-11861,-11847,-11831,-11798,
-11781,-11604,-11589,-11536,-11358,-11340,-11339,-11324,-11303,-11097,-11077,-11067,
-11055,-11052,-11045,-11041,-11038,-11024,-11020,-11019,-11018,-11014,-10838,-10832,
-10815,-10800,-10790,-10780,-10764,-10587,-10544,-10533,-10519,-10331,-10329,-10328,
-10322,-10315,-10309,-10307,-10296,-10281,-10274,-10270,-10262,-10260,-10256,-10254
};
private static string[] pyName = new string[]
{
"A","Ai","An","Ang","Ao","Ba","Bai","Ban","Bang","Bao","Bei","Ben",
"Beng","Bi","Bian","Biao","Bie","Bin","Bing","Bo","Bu","Ba","Cai","Can",
"Cang","Cao","Ce","Ceng","Cha","Chai","Chan","Chang","Chao","Che","Chen","Cheng",
"Chi","Chong","Chou","Chu","Chuai","Chuan","Chuang","Chui","Chun","Chuo","Ci","Cong",
"Cou","Cu","Cuan","Cui","Cun","Cuo","Da","Dai","Dan","Dang","Dao","De",
"Deng","Di","Dian","Diao","Die","Ding","Diu","Dong","Dou","Du","Duan","Dui",
"Dun","Duo","E","En","Er","Fa","Fan","Fang","Fei","Fen","Feng","Fo",
"Fou","Fu","Ga","Gai","Gan","Gang","Gao","Ge","Gei","Gen","Geng","Gong",
"Gou","Gu","Gua","Guai","Guan","Guang","Gui","Gun","Guo","Ha","Hai","Han",
"Hang","Hao","He","Hei","Hen","Heng","Hong","Hou","Hu","Hua","Huai","Huan",
"Huang","Hui","Hun","Huo","Ji","Jia","Jian","Jiang","Jiao","Jie","Jin","Jing",
"Jiong","Jiu","Ju","Juan","Jue","Jun","Ka","Kai","Kan","Kang","Kao","Ke",
"Ken","Keng","Kong","Kou","Ku","Kua","Kuai","Kuan","Kuang","Kui","Kun","Kuo",
"La","Lai","Lan","Lang","Lao","Le","Lei","Leng","Li","Lia","Lian","Liang",
"Liao","Lie","Lin","Ling","Liu","Long","Lou","Lu","Lv","Luan","Lue","Lun",
"Luo","Ma","Mai","Man","Mang","Mao","Me","Mei","Men","Meng","Mi","Mian",
"Miao","Mie","Min","Ming","Miu","Mo","Mou","Mu","Na","Nai","Nan","Nang",
"Nao","Ne","Nei","Nen","Neng","Ni","Nian","Niang","Niao","Nie","Nin","Ning",
"Niu","Nong","Nu","Nv","Nuan","Nue","Nuo","O","Ou","Pa","Pai","Pan",
"Pang","Pao","Pei","Pen","Peng","Pi","Pian","Piao","Pie","Pin","Ping","Po",
"Pu","Qi","Qia","Qian","Qiang","Qiao","Qie","Qin","Qing","Qiong","Qiu","Qu",
"Quan","Que","Qun","Ran","Rang","Rao","Re","Ren","Reng","Ri","Rong","Rou",
"Ru","Ruan","Rui","Run","Ruo","Sa","Sai","San","Sang","Sao","Se","Sen",
"Seng","Sha","Shai","Shan","Shang","Shao","She","Shen","Sheng","Shi","Shou","Shu",
"Shua","Shuai","Shuan","Shuang","Shui","Shun","Shuo","Si","Song","Sou","Su","Suan",
"Sui","Sun","Suo","Ta","Tai","Tan","Tang","Tao","Te","Teng","Ti","Tian",
"Tiao","Tie","Ting","Tong","Tou","Tu","Tuan","Tui","Tun","Tuo","Wa","Wai",
"Wan","Wang","Wei","Wen","Weng","Wo","Wu","Xi","Xia","Xian","Xiang","Xiao",
"Xie","Xin","Xing","Xiong","Xiu","Xu","Xuan","Xue","Xun","Ya","Yan","Yang",
"Yao","Ye","Yi","Yin","Ying","Yo","Yong","You","Yu","Yuan","Yue","Yun",
"Za", "Zai","Zan","Zang","Zao","Ze","Zei","Zen","Zeng","Zha","Zhai","Zhan",
"Zhang","Zhao","Zhe","Zhen","Zheng","Zhi","Zhong","Zhou","Zhu","Zhua","Zhuai","Zhuan",
"Zhuang","Zhui","Zhun","Zhuo","Zi","Zong","Zou","Zu","Zuan","Zui","Zun","Zuo"
};
/// <summary>
/// 把汉字转换成拼音(全拼)
/// </summary>
/// <param name="hzString">汉字字符串</param>
/// <returns>转换后的拼音(全拼)字符串</returns>
public static string Convert(string hzString)
{
// 匹配中文字符
Regex regex = new Regex("^[\u4e00-\u9fa5]$");
byte[] array = new byte[2];
string pyString = "";
int chrAsc = 0;
int i1 = 0;
int i2 = 0;
char[] noWChar = hzString.ToCharArray();
for (int j = 0; j < noWChar.Length; j++)
{
// 中文字符
if (regex.IsMatch(noWChar[j].ToString()))
{
array = System.Text.Encoding.Default.GetBytes(noWChar[j].ToString());
i1 = (short)(array[0]);
i2 = (short)(array[1]);
chrAsc = i1 * 256 + i2 - 65536;
if (chrAsc > 0 && chrAsc < 160)
{
pyString += noWChar[j];
}
else
{
// 修正部分文字
if (chrAsc == -9254) // 修正“圳”字
pyString += "Zhen";
else
{
for (int i = (pyValue.Length - 1); i >= 0; i--)
{
if (pyValue[i] <= chrAsc)
{
pyString += pyName[i];
break;
}
}
}
}
}
// 非中文字符
else
{
pyString += noWChar[j].ToString();
}
}
return pyString;
}
}
You can use the following method:
from __future__ import unicode_literals
from pypinyin import lazy_pinyin
hanzi_list = ['如何', '将', '汉字','转为', '拼音']
pinyin_list = [''.join(lazy_pinyin(_)) for _ in hanzi_list]
Output:
['ruhe', 'jiang', 'hanzi', 'zhuanwei', 'pinyin']
i had this problem and i found a solution in PHP (which could be cleaner i suppose but it works). I had some troubles because the file given in this topic is from hexa unicode.
1) Import the data from ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/Uni2Pinyin.gz (thanks pierr) to your database or whatever
2) Import your data in an array as $pinyinArray[$hexaUnicode] = $pinyin;
3) Use this code:
/*
* Decimal representation of $c
* function found there: http://www.cantonese.sheik.co.uk/phorum/read.php?2,19594
*/
function uniord($c)
{
$ud = 0;
if (ord($c{0})>=0 && ord($c{0})<=127)
$ud = $c{0};
if (ord($c{0})>=192 && ord($c{0})<=223)
$ud = (ord($c{0})-192)*64 + (ord($c{1})-128);
if (ord($c{0})>=224 && ord($c{0})<=239)
$ud = (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
if (ord($c{0})>=240 && ord($c{0})<=247)
$ud = (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
if (ord($c{0})>=248 && ord($c{0})<=251)
$ud = (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
if (ord($c{0})>=252 && ord($c{0})<=253)
$ud = (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
if (ord($c{0})>=254 && ord($c{0})<=255) //error
$ud = false;
return $ud;
}
/*
* Translate the $string string of a single chinese charactere to unicode
*/
function chineseToHexaUnicode($string) {
return strtoupper(dechex(uniord($string)));
}
/*
*
*/
function convertChineseToPinyin($string,$pinyinArray) {
$pinyinValue = '';
for ($i = 0; $i < mb_strlen($string);$i++)
$pinyinValue.=$pinyinArray[chineseToHexaUnicode(mb_substr($string, $i, 1))];
return $pinyinValue;
}
$string = '龙江省五大';
echo convertChineseToPinyin($string,$pinyinArray);
echo: (long2)(jiang1)(sheng3,xing3)(wu3)(da4,dai4)
Of course, $pinyinArray is your array of data (hexoUnicode => pinyin)
Hope it will help someone.
If you use Visual Studio, this might be an option:
Microsoft.International.Converters.PinYinConverter
How to install:
First, download the Visual Studio International Pack 2.0, Official Download. Once the download is complete install the run file VSIPSetup.msi installation (x86 operating system on the default installation directory (C:\Program Files\Microsoft Visual Studio International Feature Pack 2.0).
After installation, you need to add a reference in VS, respectively reference:
C:\Program Files\Microsoft Visual Studio International Pack\Simplified Chinese Pin-Yin Conversion Library (Pinyin)
and
C:\Program Files\Microsoft Visual Studio International Pack\Traditional Chinese to Simplified Chinese Conversion Library and Add-In Tool (Traditional and Simplified Huzhuan to)
How to use:
public static string GetPinyin(string str)
{
string r = string.Empty;
foreach (char obj in str)
{
try
{
ChineseChar chineseChar = new ChineseChar(obj);
string t = chineseChar.Pinyins[0].ToString();
r += t.Substring(0, t.Length - 1);
}
catch
{
r += obj.ToString();
}
}
return r;
}
Source:
http://www.programering.com/a/MzM3cTMwATA.html

nice way of doing 'if ALL items in collection pass test'

I mean this:
bool passed = true;
for(int i = 0; i < collection.Length; i++)
{
if(!PassesTest(collection[i]))
{
passed = false;
break;
}
}
if(passed){/*passed code*/}
requires extra variable, extra test
for(int i = 0; i < collection.Length; i++)
{
if(!PassesTest(collection[i]))
{
return;
}
}
{/*passed code*/}
neat, but requires this to be it's own function, if this it's self is inside a loop or something, not the most performant way of doing things. also, writing a whole extra function is a pain
if(passed){/*passed code*/}
for(int i = 0; i < collection.Length; i++)
{
if(!PassesTest(collection[i]))
{
goto failed;
}
}
{/*passed code*/}
failed: {}
great, but you have to screw around with label names and ugly label syntax
for(int i = 0; ; i++)
{
if(!(i < collection.Length))
{
{/*passed code*/}
break;
}
if(!PassesTest(collection[i]))
{
break;
}
}
probably the nicest, but still a bit manual, kinda wasting the functionality of the for loop construct, for instance, you can't do this with a foreach
what is the nicest way to handle this problem?
it seems to me something like this would be nice:
foreach(...)
{
...
}
finally{...} // only executed if loop ends conventionally (without break)
am I missing something? because this is a very common problem for me, and I don't really like any of the solutions I've come up with.
I use c++ and C#, so solutions in either would be great.
but would also be interested in solutions in other languages. (though a design principle that avoids this in any language would be ideal)
If your language doesn't have this feature, write a function "forall," which takes two arguments: a list and a boolean-valued function which is to be true for all elements of the list. Then you only have to write it once, and it matters very little how idiomatic it is.
The "forall" function looks exactly like your second code sample, except that now "collection" and "passesTest" are the arguments to that function.
Calling forall looks roughly like:
if (forall(myList,isGood)) {
which is readable.
As an added bonus, you could implement "exists" by calling "forall" on the negated boolean function, and negating its answer. That is, "exists x P(x)" is implemented as "not forall x not P(x)".
You can use Linq in .NET.
Here's an example in C#:
if(collection.All(item => PassesTest(item)))
{
// do my code
}
C++ and STL
if (std::all_of(collection.begin(), collection.end(), PassesTest))
{
/* passed code */
}
Ruby:
if collection.all? {|n| n.passes_test?}
# do something
end
Clojure:
(if (every? passes-test? collection)
; do something
)
Groovy:
if (!collection.find { !PassesTest(it) }) {
// do something
}
Scala:
def passesTest(i: Int) = i < 5 // example, would likely be idiomatically in-line
var seq = List(1,2,3,4);
seq.forall(passesTest) // => True
Most, if not all of the answers presented here are saying: higher-order constructs -- such as "passing functions" -- are really, really nice. If you are designing a language, don't forget about the last 60+ years of programming languages/designs.
Python (since v2.5):
if all(PassesTest(c) for c in collection):
do something
Notes:
we can iterate directly on collection, no need for an index and lookup
all() and any() builtin functions were added in Python 2.5
the argument to any(), i.e. PassesTest(c) for c in collection, is a generator expression. You could also view it as a list comprehension by adding brackets [(PassesTest(c) for c in collection]
... for c in collection causes iteration through a collection (or sequence/tuple/list). For a dict, use dict.keys(). For other datatypes, use the correct iterator method.

Resources