CoreNLP Sentiment training data in wrong format - stanford-nlp

I'm trying to train my own sentiment analysis model for corenlp. I want to do this in java code (not from the command line), so I copied pieces from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/BuildBinarizedDataset.java to prepare the data, and then copying some pieces from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/SentimentTraining.java to do the actual training. I condensed the code of the former link, lines 171-226 a bit in my own code (to understand what's going on), into the following:
String text = IOUtils.slurpFileNoExceptions(inputPath);
String[] chunks = text.split("\\n\\s*\\n+"); // need blank line to
for (String chunk : chunks) {
if (chunk.trim().isEmpty()) {
continue;
}
String[] lines = chunk.trim().split("\\n");
String sentence = lines[0];
StringReader sin = new StringReader(sentence);
DocumentPreprocessor document = new DocumentPreprocessor(sin);
document.setSentenceFinalPuncWords(new String[] { "\n" });
List<HasWord> tokens = document.iterator().next();
Integer mainLabel = new Integer(tokens.get(0).word());
tokens = tokens.subList(1, tokens.size());
Map<Pair<Integer, Integer>, String> spanToLabels = Generics.newHashMap();
for (int i = 1; i < lines.length; ++i) {
extractLabels(spanToLabels, tokens, lines[i]);
}
Tree tree = parser.apply(tokens);
Tree binarized = binarizer.transformTree(tree);
Tree collapsedUnary = transformer.transformTree(binarized);
if (sentimentModel != null) {
Trees.convertToCoreLabels(collapsedUnary);
SentimentCostAndGradient scorer = new SentimentCostAndGradient(sentimentModel, null);
scorer.forwardPropagateTree(collapsedUnary);
setPredictedLabels(collapsedUnary);
} else {
setUnknownLabels(collapsedUnary, mainLabel);
}
Trees.convertToCoreLabels(collapsedUnary);
collapsedUnary.indexSpans();
for (Map.Entry<Pair<Integer, Integer>, String> pairStringEntry : spanToLabels.entrySet()) {
setSpanLabel(collapsedUnary, pairStringEntry.getKey(), pairStringEntry.getValue());
}
//trainingTrees.add(collapsedUnary);
System.out.println("Debugging collaped Unary:" + collapsedUnary);
}
The println gives me something like:
> Debugging collaped Unary:(ROOT (NP (DT The) (NNS performances)) (#S (VP (VBP are) (ADJP (RB uniformly) (JJ good))) (. .)))
Whereas, from what I understand, it is supposed to look like this (as for the format, sorry for copying another sentence here)):
(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2
As explained in https://mailman.stanford.edu/pipermail/java-nlp-user/2013-November/004308.html , stanford corenlp sentiment training set , How to train the Stanford NLP Sentiment Analysis tool , etc.
Nothing happens after these lines in BuildBinarizedDataset. Can someone tell me how to get it into the right format? (hacking something together myself feels quite stupid here, and there must be something I'm missing.)
i.e. the error I get later on, in SentimentTraining, is:
Exception in thread "main" java.lang.NumberFormatException: For input string: "DT"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.valueOf(Integer.java:766)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:37)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:33)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:33)
at edu.stanford.nlp.sentiment.SentimentUtils.readTreesWithLabels(SentimentUtils.java:69)
at edu.stanford.nlp.sentiment.SentimentUtils.readTreesWithGoldLabels(SentimentUtils.java:50)
at de.dkt.eservices.esentimentanalysis.modules.CoreNLPSentimentAnalyzer.trainModel(CoreNLPSentimentAnalyzer.java:251)
at de.dkt.eservices.esentimentanalysis.modules.CoreNLPSentimentAnalyzer.main(CoreNLPSentimentAnalyzer.java:306)
which makes sense, given that it expects a number, but gets the label of the node in the tree...
Would be grateful for any pointers here!

Haven't found a real solution, but in case someone else runs into this problem, the following did the trick:
public static Tree traverseTreeAndChangePosTagsToNumbers(Tree tree) {
for (Tree subtree : tree.getChildrenAsList()) {
if (subtree.label().toString().matches("\\D+")) {
subtree.label().setValue("2");
}if (Integer.parseInt(subtree.label().toString())<0||Integer.parseInt(subtree.label().toString())>4){
subtree.label().setValue("2");
}
if (!(subtree.isPreTerminal())) {
traverseTreeAndChangePosTagsToNumbers(subtree);
}
}
return tree;
}
Not really a decent solution, since it does not acknowledge the option to provide scope for sentiment (i.e. annotating subphrases in the tree, as the number for subphrases is always 2 (neutral)), so sentiment is always based on the value for the whole sentence/tree, but at least it gets rid of the syntax error.

Related

Getting the Shorest Path question right in Kotlin

So I got a question that was delivered as a 2D List
val SPE = listOf(
listOf('w', 'x'),
listOf('x', 'y'),
listOf('z', 'y'),
listOf('z', 'v'),
listOf('w', 'v')
)
It asks to find the shortest path between w and z. So obviously, BFS would be the best course of action here to find that path the fastest. Here's my code for it
fun shortestPath(edges: List<List<Char>>, root: Char, destination: Char): Int {
val graph = buildGraph3(edges)
val visited = hashSetOf(root)
val queue = mutableListOf(mutableListOf(root, 0))
while (queue.size > 0){
val node = queue[0].removeFirst()
val distance = queue[0].removeAt(1)
if (node == destination) return distance as Int
graph[node]!!.forEach{
if (!visited.contains(it)){
visited.add(it)
queue.add(mutableListOf(it, distance + 1))
}
}
}
queue.sortedByDescending { it.size }
return queue[0][1]
}
fun buildGraph3(edges: List<List<Char>>): HashMap<Char, MutableList<Char>> {
val graph = HashMap<Char, MutableList<Char>>()
for (i in edges.indices){
for (n in 0 until edges[i].size){
var a = edges[i][0]
var b = edges[i][1]
if (!graph.containsKey(a)) { graph[a] = mutableListOf() }
if (!graph.containsKey(b)) { graph[b] = mutableListOf() }
graph[a]?.add(b)
graph[b]?.add(b)
}
}
return graph
}
I am stuck on the return part. I wanted to use a list to keep track of the incrementation of the char, but it wont let me return the number. I could have done this wrong, so any help is appreciated. Thanks.
If I paste your code into an editor I get this warning on your return queue[0][1] statement:
Type mismatch: inferred type is {Comparable<*> & java.io.Serializable} but Int was expected
The problem here is queue contains lists that hold Chars and Int distances, mixed together. You haven't specified the type that list holds, so Kotlin has to infer it from the types of the things you've put in the list. The most general type that covers both is Any?, but the compiler tries to be as specific as it can, inferring the most specific type that covers both Char and Int.
In this case, that's Comparable<*> & java.io.Serializable. So when you pull an item out with queue[0][1], the value you get is a Comparable<*> & java.io.Serializable, not an Int, which is what your function is supposed to be returning.
You can "fix" this by casting - since you know how your list is meant to be organised, two elements with a Char then an Int, you can provide that information to the compiler, since it has no idea what you're doing beyond what it can infer:
val node = queue[0].removeFirst() as Char
val distance = queue[0].removeAt(1) as Int
...
return queue[0][1] as Int
But ideally you'd be using the type system to create some structure around your data, so the compiler knows exactly what everything is. The most simple, generic one of these is a Pair (or a Triple if you need 3 elements):
val queue = mutableListOf(Pair<Char, Int>(root, 0))
// or if you don't want to explicitly specify the type
val queue = mutableListOf(root to 0)
Now the type system knows that the items in your queue are Pairs where the first element is a Char, and the second is an Int. No need to cast anything, and it will be able to help you as you try to work with that data, and tell you if you're doing the wrong thing.
It might be better to make actual classes that reflect your data, e.g.
data class Step(node: Char, distance: Int)
because a Pair is pretty general, but it's up to you. You can pull the data out of it like this:
val node = queue[0].first
val distance = queue[0].second
// or use destructuring to assign the components to multiple variables at once
val (node, distance) = queue[0]
If you make those changes, you'll have to rework some of your algorithm - but you'll have to do that anyway, it's broken in a few ways. I'll just give you some pointers:
your return queue[0][1] line can only be reached when queue is empty
queue[0].removeAt(1) is happening on a list that now has 1 element (i.e. at index 0)
don't you need to remove items from your queue instead?
when building your graph, you call add(b) twice
try printing your graph, the queue at each stage in the loop etc to see what's happening! Make sure it's doing what you expect. Comment out any code that doesn't work so you can make sure the stuff that runs before that is working.
Good luck with it! Hopefully once you get your types sorted out things will start to fall into place more easily

Any Java functions for blocking non English words?

Please suggest me the best Java api for removing non English words and blocking incorrect words using
I use an English words list file to parse the given string. The code is responding very slowly. `
String englishword;
while ((englishword = br.readLine()) != null) {
//System.out.println("#"+englishword);
for (String word : wordsArray) {
//System.out.println("#"+word);
if(englishword.trim().toUpperCase().equals(word.trim().toUpperCase()))
{
linetmp = linetmp.replaceAll(word, " ").trim();
break;
}
}
}
if(linetmp!=null)
for(String nonEnglish:linetmp.split("\\s+"))
{
line = line.replaceAll(nonEnglish, "");
}
line = line.replaceAll(" +", " ");
return line;
Please suggest me if there is any faster way to do this
Note: i am using Linux OS's dictionary listy
Make trim() and touppercase() of the checked word only once, out of the for (String word : wordsArray) cycle.
If you'll do excessive heavy operations in the inner cycle, no API will help you.
You can use a Java API function for searching
import org.apache.commons.lang.ArrayUtils;
ArrayUtils.indexOf(array, string);
You can make your code a lot faster1 by changing the wordsArray to a HashSet, and using the contains(String) method to do the checks. (Make sure you convert words to upper case when you build the set.)
However, I would point out that this approach doesn't scale. It is not practical to enumerate all possible "non-English or incorrect" words. You would be better off building a set containing all of the words that you are prepared to accept, and then eliminating the words not in the set.
1 - Currently, your inner loop takes time that is proportional to the number of words (N) in wordArray; i.e. O(N). If you use a HashSet, the operation takes O(1) time; i.e. roughly constant time.
There is a faster way.
Create a HashSet<String> containing all your elements in wordsArray (as lower cases/upper cases).
For each new word englishword check if set.contains(englishword.toLowerCase()).
This solution runs in O(n|S|) pre-processing (creating the HashSet), and checking each word is O(|S|) where |S| is the length of the string and n is number of words in the array, while your solution is basically O(n|S|) per word.
Code snap:
public static class EnglishChecker {
private final Set<String> set;
public EnglishChecker(String[] englishWords) {
set = new HashSet<>();
for (String s : englishWords) {
set.add(s.toLowerCase());
}
}
public boolean isWord(String s) {
return set.contains(s.toLowerCase());
}
}
public static void main(String[] args) {
String[] words = { "Cat", "dog", "mousE" };
EnglishChecker checker = new EnglishChecker(words);
System.out.println(checker.isWord("cat"));
System.out.println(checker.isWord("cccccccat"));
System.out.println(checker.isWord("MOUSE"));
}

Google search suggestion implementation

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.
There are 4 most common types of erros -
Omitted letter: "stck" instead of "stack"
One letter typo: "styck" instead of "stack"
Extra letter: "starck" instead of "stack"
Adjacent letters swapped: "satck" instead of "stack"
BTW, we can swap not adjacent letters but any letters but this is not common typo.
Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.
After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.
Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.
Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.
UPDATE
Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.
Sample code for generating corrections by BFS
static class Pair
{
private String word;
private byte dist;
// dist is byte because dist<=128.
// Moreover, dist<=6 in real application
public Pair(String word,byte dist)
{
this.word = word;
this.dist = dist;
}
public String getWord()
{
return word;
}
public int getDist()
{
return dist;
}
}
public static void main(String[] args) throws Exception
{
HashSet<String> usedWords;
HashSet<String> dict;
ArrayList<String> corrections;
ArrayDeque<Pair> states;
usedWords = new HashSet<String>();
corrections = new ArrayList<String>();
dict = new HashSet<String>();
states = new ArrayDeque<Pair>();
// populate dictionary. In real usage should be populated from prepared file.
dict.add("Jeniffer");
dict.add("Jeniffert"); //depth 2 test
usedWords.add("Jniffer");
states.add(new Pair("Jniffer", (byte)0));
while(!states.isEmpty())
{
Pair head = states.pollFirst();
//System.out.println(head.getWord()+" "+head.getDist());
if(head.getDist()<=2)
{
// checking reached depth.
//4 is the first depth where we don't generate anything
// swap adjacent letters
for(int i=0;i<head.getWord().length()-1;i++)
{
// swap i-th and i+1-th letters
String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
// even if i==curWord.length()-2 and then i+2==curWord.length
//substring(i+2) doesn't throw exception and returns empty string
// the same for substring(0,i) when i==0
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
// insert letters
for(int i=0;i<=head.getWord().length();i++)
for(char ch='a';ch<='z';ch++)
{
String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
}
}
for(String correction:corrections)
{
System.out.println("Did you mean "+correction+"?");
}
usedWords.clear();
corrections.clear();
// helper data structures must be cleared after each generateCorrections call - must be empty for the future usage.
}
Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)
Output:
Did you mean Jeniffer?
Did you mean Jeniffert?
Important!
I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.
Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.
I didn't implement erase letters and change letters operations because I want to show only main idea.

How to convert Chinese characters to Pinyin [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed last year.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
For sorting Chinese language text, I want to convert Chinese characters to Pinyin, properly separating each Chinese character and grouping successive characters together.
Can you please help me in this task by providing the logic or source code for doing this?
Please let me know if any open source or lib already present for this.
Short answer: you don't.
Long answer: There is no one-to-one mapping for 汉字 to 汉语拼音. Just some quick examples:
把 can be "ba" in the third tone or fourth tone.
了 can be "le" toneless or "liao" third tone.
乐 can be "le" or "yue", both in the fourth tone.
落 can be "luo", "la" or "lao", all in the fourth tone.
And so on. I have a beginners' book on this topic that has 207 examples. I stress that this is a beginners' book and is by no means complete. Each one has a page or two of examples of use and conditions under which you choose the appropriate pronunciation. It is not something that could be easily programmed (if at all).
And this doesn't even address the other slippery thing you want to deal with: the separation of characters into grouped words. The very notion of a word is a bit slippery in Chinese. (There's two terms that correspond, roughly to "word" in Chinese for example: 字 and 词. The first is the character, the second groups of characters that are put together into one concept. (I frequently get asked by Chinese speakers how many "words" I can read when they really mean "characters".) While in some cases the distinction is clear (the 词 "乌鸦", for example, is "crow" -- the two 字 must be together to express the idea properly and it would be incorrect to translate it as "black crow"), in others it is not so clear. What does "你好" translate to? Is it one word meaning, idiomatically, "hello"? Or is it two words translating literally to "you good"? Each of the characters involved stands alone or in groups with other words, but together they mean something entirely different from their individual meanings. Given this, how, precisely, do you plan to group the 汉语拼音 transliterations (which are difficult to impossible to get right in the first place!) into "words"?
While #JUST MY correct OPINION's answer addresses some of the difficulties of converting characters into pinyin, it is not an impossible problem to solve.
I have written a library (pinyinify) that solves this task with decent accuracy. Even though there is not a one-to-one mapping between characters and pinyin, my library can usually decide which pronunciation is correct. For example, "我受不了了" correctly converts to "wǒ shòubùliǎo le", with two different pronunciations of 了.
My approach to solving the problem is pretty simple:
First segment the text into words. For example, 我喜欢旅游 would be divided into three words: 我 喜欢 旅游. This is also not a simple process, but there are many libraries for it. jieba is one of the more popular libraries for this purpose.
Use a dictionary to convert the words into pinyin.
If the word is not in the dictionary, fall back to converting the individual characters to pinyin using their most common pronunciation.
CoreFoundation provides certain method to do the conversion:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("中文"));
CFStringTransform(string, NULL, kCFStringTransformMandarinLatin, NO);
CFStringTransform(string, NULL, kCFStringTransformStripDiacritics, NO);
NSLog(#"%#", string);
The output is
zhong wen
the following code writing in C# can help you to simply convert chinese words that including in gb2312 encodec(just 2312 of often used Simplified-Chinese words) to pinyin.like convert "今天天气不错" to "JinTianTianQiBuCuo".
sometimes a chinese word is not one to one map to a pinyin,it depends on the context we talk about.like the "行" in "自行车"(bike) is pronounced "Xing",but in "银行"(bank) it pronounced "Hang".so if you have problem with this,you may find more complex solution to handle this.
sorry for my poor english.i hope this could give you a little help.
public class ChineseToPinYin
{
private static int[] pyValue = new int[]
{
-20319,-20317,-20304,-20295,-20292,-20283,-20265,-20257,-20242,-20230,-20051,-20036,
-20032,-20026,-20002,-19990,-19986,-19982,-19976,-19805,-19784,-19775,-19774,-19763,
-19756,-19751,-19746,-19741,-19739,-19728,-19725,-19715,-19540,-19531,-19525,-19515,
-19500,-19484,-19479,-19467,-19289,-19288,-19281,-19275,-19270,-19263,-19261,-19249,
-19243,-19242,-19238,-19235,-19227,-19224,-19218,-19212,-19038,-19023,-19018,-19006,
-19003,-18996,-18977,-18961,-18952,-18783,-18774,-18773,-18763,-18756,-18741,-18735,
-18731,-18722,-18710,-18697,-18696,-18526,-18518,-18501,-18490,-18478,-18463,-18448,
-18447,-18446,-18239,-18237,-18231,-18220,-18211,-18201,-18184,-18183, -18181,-18012,
-17997,-17988,-17970,-17964,-17961,-17950,-17947,-17931,-17928,-17922,-17759,-17752,
-17733,-17730,-17721,-17703,-17701,-17697,-17692,-17683,-17676,-17496,-17487,-17482,
-17468,-17454,-17433,-17427,-17417,-17202,-17185,-16983,-16970,-16942,-16915,-16733,
-16708,-16706,-16689,-16664,-16657,-16647,-16474,-16470,-16465,-16459,-16452,-16448,
-16433,-16429,-16427,-16423,-16419,-16412,-16407,-16403,-16401,-16393,-16220,-16216,
-16212,-16205,-16202,-16187,-16180,-16171,-16169,-16158,-16155,-15959,-15958,-15944,
-15933,-15920,-15915,-15903,-15889,-15878,-15707,-15701,-15681,-15667,-15661,-15659,
-15652,-15640,-15631,-15625,-15454,-15448,-15436,-15435,-15419,-15416,-15408,-15394,
-15385,-15377,-15375,-15369,-15363,-15362,-15183,-15180,-15165,-15158,-15153,-15150,
-15149,-15144,-15143,-15141,-15140,-15139,-15128,-15121,-15119,-15117,-15110,-15109,
-14941,-14937,-14933,-14930,-14929,-14928,-14926,-14922,-14921,-14914,-14908,-14902,
-14894,-14889,-14882,-14873,-14871,-14857,-14678,-14674,-14670,-14668,-14663,-14654,
-14645,-14630,-14594,-14429,-14407,-14399,-14384,-14379,-14368,-14355,-14353,-14345,
-14170,-14159,-14151,-14149,-14145,-14140,-14137,-14135,-14125,-14123,-14122,-14112,
-14109,-14099,-14097,-14094,-14092,-14090,-14087,-14083,-13917,-13914,-13910,-13907,
-13906,-13905,-13896,-13894,-13878,-13870,-13859,-13847,-13831,-13658,-13611,-13601,
-13406,-13404,-13400,-13398,-13395,-13391,-13387,-13383,-13367,-13359,-13356,-13343,
-13340,-13329,-13326,-13318,-13147,-13138,-13120,-13107,-13096,-13095,-13091,-13076,
-13068,-13063,-13060,-12888,-12875,-12871,-12860,-12858,-12852,-12849,-12838,-12831,
-12829,-12812,-12802,-12607,-12597,-12594,-12585,-12556,-12359,-12346,-12320,-12300,
-12120,-12099,-12089,-12074,-12067,-12058,-12039,-11867,-11861,-11847,-11831,-11798,
-11781,-11604,-11589,-11536,-11358,-11340,-11339,-11324,-11303,-11097,-11077,-11067,
-11055,-11052,-11045,-11041,-11038,-11024,-11020,-11019,-11018,-11014,-10838,-10832,
-10815,-10800,-10790,-10780,-10764,-10587,-10544,-10533,-10519,-10331,-10329,-10328,
-10322,-10315,-10309,-10307,-10296,-10281,-10274,-10270,-10262,-10260,-10256,-10254
};
private static string[] pyName = new string[]
{
"A","Ai","An","Ang","Ao","Ba","Bai","Ban","Bang","Bao","Bei","Ben",
"Beng","Bi","Bian","Biao","Bie","Bin","Bing","Bo","Bu","Ba","Cai","Can",
"Cang","Cao","Ce","Ceng","Cha","Chai","Chan","Chang","Chao","Che","Chen","Cheng",
"Chi","Chong","Chou","Chu","Chuai","Chuan","Chuang","Chui","Chun","Chuo","Ci","Cong",
"Cou","Cu","Cuan","Cui","Cun","Cuo","Da","Dai","Dan","Dang","Dao","De",
"Deng","Di","Dian","Diao","Die","Ding","Diu","Dong","Dou","Du","Duan","Dui",
"Dun","Duo","E","En","Er","Fa","Fan","Fang","Fei","Fen","Feng","Fo",
"Fou","Fu","Ga","Gai","Gan","Gang","Gao","Ge","Gei","Gen","Geng","Gong",
"Gou","Gu","Gua","Guai","Guan","Guang","Gui","Gun","Guo","Ha","Hai","Han",
"Hang","Hao","He","Hei","Hen","Heng","Hong","Hou","Hu","Hua","Huai","Huan",
"Huang","Hui","Hun","Huo","Ji","Jia","Jian","Jiang","Jiao","Jie","Jin","Jing",
"Jiong","Jiu","Ju","Juan","Jue","Jun","Ka","Kai","Kan","Kang","Kao","Ke",
"Ken","Keng","Kong","Kou","Ku","Kua","Kuai","Kuan","Kuang","Kui","Kun","Kuo",
"La","Lai","Lan","Lang","Lao","Le","Lei","Leng","Li","Lia","Lian","Liang",
"Liao","Lie","Lin","Ling","Liu","Long","Lou","Lu","Lv","Luan","Lue","Lun",
"Luo","Ma","Mai","Man","Mang","Mao","Me","Mei","Men","Meng","Mi","Mian",
"Miao","Mie","Min","Ming","Miu","Mo","Mou","Mu","Na","Nai","Nan","Nang",
"Nao","Ne","Nei","Nen","Neng","Ni","Nian","Niang","Niao","Nie","Nin","Ning",
"Niu","Nong","Nu","Nv","Nuan","Nue","Nuo","O","Ou","Pa","Pai","Pan",
"Pang","Pao","Pei","Pen","Peng","Pi","Pian","Piao","Pie","Pin","Ping","Po",
"Pu","Qi","Qia","Qian","Qiang","Qiao","Qie","Qin","Qing","Qiong","Qiu","Qu",
"Quan","Que","Qun","Ran","Rang","Rao","Re","Ren","Reng","Ri","Rong","Rou",
"Ru","Ruan","Rui","Run","Ruo","Sa","Sai","San","Sang","Sao","Se","Sen",
"Seng","Sha","Shai","Shan","Shang","Shao","She","Shen","Sheng","Shi","Shou","Shu",
"Shua","Shuai","Shuan","Shuang","Shui","Shun","Shuo","Si","Song","Sou","Su","Suan",
"Sui","Sun","Suo","Ta","Tai","Tan","Tang","Tao","Te","Teng","Ti","Tian",
"Tiao","Tie","Ting","Tong","Tou","Tu","Tuan","Tui","Tun","Tuo","Wa","Wai",
"Wan","Wang","Wei","Wen","Weng","Wo","Wu","Xi","Xia","Xian","Xiang","Xiao",
"Xie","Xin","Xing","Xiong","Xiu","Xu","Xuan","Xue","Xun","Ya","Yan","Yang",
"Yao","Ye","Yi","Yin","Ying","Yo","Yong","You","Yu","Yuan","Yue","Yun",
"Za", "Zai","Zan","Zang","Zao","Ze","Zei","Zen","Zeng","Zha","Zhai","Zhan",
"Zhang","Zhao","Zhe","Zhen","Zheng","Zhi","Zhong","Zhou","Zhu","Zhua","Zhuai","Zhuan",
"Zhuang","Zhui","Zhun","Zhuo","Zi","Zong","Zou","Zu","Zuan","Zui","Zun","Zuo"
};
/// <summary>
/// 把汉字转换成拼音(全拼)
/// </summary>
/// <param name="hzString">汉字字符串</param>
/// <returns>转换后的拼音(全拼)字符串</returns>
public static string Convert(string hzString)
{
// 匹配中文字符
Regex regex = new Regex("^[\u4e00-\u9fa5]$");
byte[] array = new byte[2];
string pyString = "";
int chrAsc = 0;
int i1 = 0;
int i2 = 0;
char[] noWChar = hzString.ToCharArray();
for (int j = 0; j < noWChar.Length; j++)
{
// 中文字符
if (regex.IsMatch(noWChar[j].ToString()))
{
array = System.Text.Encoding.Default.GetBytes(noWChar[j].ToString());
i1 = (short)(array[0]);
i2 = (short)(array[1]);
chrAsc = i1 * 256 + i2 - 65536;
if (chrAsc > 0 && chrAsc < 160)
{
pyString += noWChar[j];
}
else
{
// 修正部分文字
if (chrAsc == -9254) // 修正“圳”字
pyString += "Zhen";
else
{
for (int i = (pyValue.Length - 1); i >= 0; i--)
{
if (pyValue[i] <= chrAsc)
{
pyString += pyName[i];
break;
}
}
}
}
}
// 非中文字符
else
{
pyString += noWChar[j].ToString();
}
}
return pyString;
}
}
You can use the following method:
from __future__ import unicode_literals
from pypinyin import lazy_pinyin
hanzi_list = ['如何', '将', '汉字','转为', '拼音']
pinyin_list = [''.join(lazy_pinyin(_)) for _ in hanzi_list]
Output:
['ruhe', 'jiang', 'hanzi', 'zhuanwei', 'pinyin']
i had this problem and i found a solution in PHP (which could be cleaner i suppose but it works). I had some troubles because the file given in this topic is from hexa unicode.
1) Import the data from ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/Uni2Pinyin.gz (thanks pierr) to your database or whatever
2) Import your data in an array as $pinyinArray[$hexaUnicode] = $pinyin;
3) Use this code:
/*
* Decimal representation of $c
* function found there: http://www.cantonese.sheik.co.uk/phorum/read.php?2,19594
*/
function uniord($c)
{
$ud = 0;
if (ord($c{0})>=0 && ord($c{0})<=127)
$ud = $c{0};
if (ord($c{0})>=192 && ord($c{0})<=223)
$ud = (ord($c{0})-192)*64 + (ord($c{1})-128);
if (ord($c{0})>=224 && ord($c{0})<=239)
$ud = (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
if (ord($c{0})>=240 && ord($c{0})<=247)
$ud = (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
if (ord($c{0})>=248 && ord($c{0})<=251)
$ud = (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
if (ord($c{0})>=252 && ord($c{0})<=253)
$ud = (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
if (ord($c{0})>=254 && ord($c{0})<=255) //error
$ud = false;
return $ud;
}
/*
* Translate the $string string of a single chinese charactere to unicode
*/
function chineseToHexaUnicode($string) {
return strtoupper(dechex(uniord($string)));
}
/*
*
*/
function convertChineseToPinyin($string,$pinyinArray) {
$pinyinValue = '';
for ($i = 0; $i < mb_strlen($string);$i++)
$pinyinValue.=$pinyinArray[chineseToHexaUnicode(mb_substr($string, $i, 1))];
return $pinyinValue;
}
$string = '龙江省五大';
echo convertChineseToPinyin($string,$pinyinArray);
echo: (long2)(jiang1)(sheng3,xing3)(wu3)(da4,dai4)
Of course, $pinyinArray is your array of data (hexoUnicode => pinyin)
Hope it will help someone.
If you use Visual Studio, this might be an option:
Microsoft.International.Converters.PinYinConverter
How to install:
First, download the Visual Studio International Pack 2.0, Official Download. Once the download is complete install the run file VSIPSetup.msi installation (x86 operating system on the default installation directory (C:\Program Files\Microsoft Visual Studio International Feature Pack 2.0).
After installation, you need to add a reference in VS, respectively reference:
C:\Program Files\Microsoft Visual Studio International Pack\Simplified Chinese Pin-Yin Conversion Library (Pinyin)
and
C:\Program Files\Microsoft Visual Studio International Pack\Traditional Chinese to Simplified Chinese Conversion Library and Add-In Tool (Traditional and Simplified Huzhuan to)
How to use:
public static string GetPinyin(string str)
{
string r = string.Empty;
foreach (char obj in str)
{
try
{
ChineseChar chineseChar = new ChineseChar(obj);
string t = chineseChar.Pinyins[0].ToString();
r += t.Substring(0, t.Length - 1);
}
catch
{
r += obj.ToString();
}
}
return r;
}
Source:
http://www.programering.com/a/MzM3cTMwATA.html

word distribution problem

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.
Thanks
You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.
Naively I would just build up a hash table until it hits a certain limit in memory, then sort it in memory and write this out. Finally, you can do n-way merging of each chunk. At most you will have 100/4 chunks or so, but probably many fewer provided some words are more common than others (and how they cluster).
Another option is to use a trie which was built for this kind of thing. Each character in the string becomes a branch in a 256-way tree and at the leaf you have the counter. Look up the data structure on the web.
If you can pardon the pun, "trie" this:
public class Trie : Dictionary<char, Trie>
{
public int Frequency { get; set; }
public void Add(string word)
{
this.Add(word.ToCharArray());
}
private void Add(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
this.Add(first, new Trie());
}
if (chars.Length == 1)
{
this[first].Frequency += 1;
}
else
{
this[first].Add(chars.Skip(1).ToArray());
}
}
public int GetFrequency(string word)
{
return this.GetFrequency(word.ToCharArray());
}
private int GetFrequency(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
return 0;
}
if (chars.Length == 1)
{
return this[first].Frequency;
}
else
{
return this[first].GetFrequency(chars.Skip(1).ToArray());
}
}
}
Then you can call code like this:
var t = new Trie();
t.Add("Apple");
t.Add("Banana");
t.Add("Cherry");
t.Add("Banana");
var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1
You should be able to add code to traverse the trie and return a flat list of words and their frequencies.
If you find that this too still blows your memory limit then might I suggest that you "divide and conquer". Maybe scan the source data for all the first characters and then run the trie separately against each and then concatenate the results after all of the runs.
do you know how many different words you have? if not a lot (i.e. hundred thousand) then you can stream the input, determine words and use a hash table to keep the counts. after input is done just traverse the result.
Just use a DBM file. It’s a hash on disk. If you use the more recent versions, you can use a B+Tree to get in-order traversal.
Why not use any relational DB? The procedure would be as simple as:
Create a table with the word and count.
Create index on word. Some databases have word index (f.e. Progress).
Do SELECT on this table with the word.
If word exists then increase counter.
Otherwise - add it to the table.
If you are using python, you can check the built-in iter function. It will read line by line from your file and will not cause memory problems. You should not "return" the value but "yield" it.
Here is a sample that I used to read a file and get the vector values.
def __iter__(self):
for line in open(self.temp_file_name):
yield self.dictionary.doc2bow(line.lower().split())

Resources