Using stanford parser to parse Chinese - stanford-nlp

here is my code, mostly from the demo. The program runs perfectly, but the result is very wrong. It did not spilt the words.
Thank you
public static void main(String[] args) {
LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz");
demoAPI(lp);
}
public static void demoAPI(LexicalizedParser lp) {
// This option shows loading and using an explicit tokenizer
String sent2 = "我爱你";
TokenizerFactory<CoreLabel> tokenizerFactory =
PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tok =
tokenizerFactory.getTokenizer(new StringReader(sent2));
List<CoreLabel> rawWords2 = tok.tokenize();
Tree parse = lp.apply(rawWords2);
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
System.out.println();
// You can also use a TreePrint object to print trees and dependencies
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.printTree(parse);
}

Did you make sure to segment the words? For example try running it again with "我 爱 你." as the sentence. I believe from the command line the parser will segment automatically, however I'm not sure what it does from within Java.

Related

Generate Pattern Matching with Declaration using Roslyn SyntaxGeneration

I am writing a custom Code Refactoring to transform a variable declaration into a pattern matching expression including declaration. This works great but I cannot get the declaration part to work.
I succeed in transforming MyOwnClass myClass = GetCustomClass(); into:
if(GetCustomClass() is MyOwnClass)
{
}
I fail to transform MyOwnClass myClass = GetCustomClass(); into:
if(GetCustomClass() is MyOwnClass myClass)
{
}
The code I have:
Excerpt:
var generator = SyntaxGenerator.GetGenerator(document);
SyntaxNode isTypeExpression = generator.IsTypeExpression(method, type);
SyntaxNode ifClause = generator.IfStatement(isTypeExpression, new List<SyntaxNode>(), new List<SyntaxNode>());
editor.ReplaceNode(localDeclartionSyntax, ifClause);
Full code:
private async Task<Document> MakePatternMatchingClause(Document document, LocalDeclarationStatementSyntax localDeclartionSyntax, CancellationToken c)
{
if (document.TryGetSyntaxRoot(out SyntaxNode root))
{
var editor = new SyntaxEditor(root, document.Project.Solution.Workspace);
var declaration = localDeclartionSyntax.Declaration;
var variableDeclarationSyntax = localDeclartionSyntax.Declaration.Variables.FirstOrDefault();
TypeSyntax type = localDeclartionSyntax.Declaration.Type;
var method = variableDeclarationSyntax.Initializer.Value;
SyntaxToken identifier = variableDeclarationSyntax.Identifier;
var generator = SyntaxGenerator.GetGenerator(document);
SyntaxNode isTypeExpression = generator.IsTypeExpression(method, type);
SyntaxNode ifClause = generator.IfStatement(isTypeExpression, new List<SyntaxNode>(), new List<SyntaxNode>());
editor.ReplaceNode(localDeclartionSyntax, ifClause);
return document.WithSyntaxRoot(editor.GetChangedRoot());
}
return document;
}
With a fresh head in the morning I found the solution. One needs to use the older SyntaxFactory class (which I found before but I looked for the wrong keywords)
SingleVariableDesignationSyntax singleVariableDesignation = SyntaxFactory.SingleVariableDesignation(identifier);
DeclarationPatternSyntax singleVariableDeclaration = SyntaxFactory.DeclarationPattern(type, singleVariableDesignation);
IsPatternExpressionSyntax isPatternDeclaration = SyntaxFactory.IsPatternExpression(method, singleVariableDeclaration);

How to measure string interning?

I'm trying to measure the impact of string interning in an application.
I came up with this:
class Program
{
static void Main(string[] args)
{
_ = BenchmarkRunner.Run<Benchmark>();
}
}
[MemoryDiagnoser]
public class Benchmark
{
[Params(10000, 100000, 1000000)]
public int Count { get; set; }
[Benchmark]
public string[] NotInterned()
{
var a = new string[this.Count];
for (var i = this.Count; i-- > 0;)
{
a[i] = GetString(i);
}
return a;
}
[Benchmark]
public string[] Interned()
{
var a = new string[this.Count];
for (var i = this.Count; i-- > 0;)
{
a[i] = string.Intern(GetString(i));
}
return a;
}
private static string GetString(int i)
{
var result = (i % 10).ToString();
return result;
}
}
But I always end up with the same amount of allocated.
Is there any other measure or diagnostic that gives me the memory savings of using string.Intern()?
The main question here is what kind of impact do you want to measure? To be more specific: what are your target metrics? Here are some examples: performance metrics, memory traffic, memory footprint.
In the BenchmarkDotNet Allocated column, you get the memory traffic. string.Intern doesn't help to optimize it in your example, each (i % 10).ToString() call will allocate a new string. Thus, it's expected that BenchmarkDotNet shows the same numbers in the Allocated column.
However, string.Intern should help you to optimize the memory footprint of your application at the end (the total managed heap size, can be fetched via GC.GetTotalMemory()). It can be verified with a simple console application without BenchmarkDotNet:
using System;
namespace ConsoleApp24
{
class Program
{
private const int Count = 100000;
private static string[] notInterned, interned;
static void Main(string[] args)
{
var memory1 = GC.GetTotalMemory(true);
notInterned = NotInterned();
var memory2 = GC.GetTotalMemory(true);
interned = Interned();
var memory3 = GC.GetTotalMemory(true);
Console.WriteLine(memory2 - memory1);
Console.WriteLine(memory3 - memory2);
Console.WriteLine((memory2 - memory1) - (memory3 - memory2));
}
public static string[] NotInterned()
{
var a = new string[Count];
for (var i = Count; i-- > 0;)
{
a[i] = GetString(i);
}
return a;
}
public static string[] Interned()
{
var a = new string[Count];
for (var i = Count; i-- > 0;)
{
a[i] = string.Intern(GetString(i));
}
return a;
}
private static string GetString(int i)
{
var result = (i % 10).ToString();
return result;
}
}
}
On my machine (Linux, .NET Core 3.1), I got the following results:
802408
800024
2384
The first number and the second number are the memory footprint impacts for both cases. It's pretty huge because the string array consumes a lot of memory to keep the references to all the string instances.
The third number is the footprint difference between the footprint impact of interned and not-interned string. You may ask why it's so small. This can be easily explained: Stephen Toub implemented a special cache for single-digit strings in dotnet/coreclr#18383, it's described in his blog post:
So, it doesn't make sense to measure interning of the "0".."9" strings on .NET Core. We can easily modify our program to fix this problem:
private static string GetString(int i)
{
var result = "x" + (i % 10).ToString();
return result;
}
Here are the updated results:
4002432
800344
3202088
Now the impact difference (the third number) is pretty huge (3202088). It means that interning helped us to save 3202088 bytes in the managed heap.
So, there are the most important recommendation for your future experiments:
Carefully define metrics that you actually want to measure. Don't say "I want to find all kinds of affected metrics," any changes in the source code may affect hundreds of different metrics; it's pretty hard to measure all of them in each experiment. Carefuly think about what kind of metrics are really important for you.
Try to take the input data that are close to your actual work scenarios. Benchmarking with some "dummy" data may leads to incorrect results because there are too many tricky optimizations in runtime that works pretty well with such "dummy" cases.

while(line.contains("^")) loop not breaking

this is my class:
import java.io.*;
public class Test
{
public static void main(String[] args) throws FileNotFoundException, IOException
{
BufferedReader br = new BufferedReader(new FileReader("file2.txt"));
BufferedWriter bw = new BufferedWriter(new FileWriter("file.txt"));
int i = 0;
String line;
while ((line = br.readLine()) != null) {
while(line.contains("^")) {
i ++;
line = line.replaceFirst("^", Integer.toString(i));
}
bw.write(line + "\n");
}
br.close();
bw.close();
}
}
the file2.txt and file.txt are exactly the same and I want to make the lines that look like
<wpt lat="26.381418638" lon="-80.101236298"><ele>0</ele><time> </time><name>Waypoint #^</name><desc> </desc></wpt>
to look like
<wpt lat="26.381418638" lon="-80.101236298"><ele>0</ele><time> </time><name>Waypoint #5</name><desc> </desc></wpt>
When I run it though, it goes on an infinite loop. Any advice will help. Thanks!
line = line.replaceFirst("^", Integer.toString(i));
replaceFirst's first argument is a regular expression, and "^" as a regular expression means "the start of the string". So this command just keeps prepending values to the start of the string, and never removes any circumflexes. Instead, you should write:
line = line.replaceFirst("\\^", Integer.toString(i));
The String.replaceFirst method takes a regular expression which has special characters for certain operations - one of these characters is the^ character. You need to escape it to look for occurances of it (In Java, since backslash is special in strings, this would be "\\^" in the "replaceFirst" argument)

HtmlAgility Pack get single node get null value

I am trying to get a single node with an XPath, but i am getting a null value on the node, don' t know why
WebClient wc = new WebClient();
string nodeValue;
string htmlCode = wc.DownloadString("http://www.freeproxylists.net/fr/?c=&pt=&pr=&a%5B%5D=0&a%5B%5D=1&a%5B%5D=2&u=50");
HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();
html.LoadHtml(htmlCode);
HtmlNode node = html.DocumentNode.SelectSingleNode("//table[#class='DataGrid']/tbody/tr[#class='Odd']/td/a");
nodeValue = (node.InnerHtml);
I see at least 2 mistakes in your xpath compared to the html you're trying to get information from.
There are no <a> that has <tr class=Odd"> as an ancestor.
Even if your Xpath had worked then you would only have gotten one <td> since you have decided to SelectSingleNode instead of SelectNodes
It looks like the are doing some kind of lazy protection from what you're trying to do. Since the a-tag is just represented in hexadecimal enclosed in IPDecode. So really it is no problem to extract the link. But the least you could have done was to look at the html before posting. You clearly have not tried at all. Since the html you're getting from your current code is not the <body> of the link you gave us - meaning you have to get the htmlpage from the absolute url or just use Selenium.
But since I am such a swell guy I will make your entire solution for you using Xpath, Html Agility Pack and Selenium. The following solutions gets the html of the site. then reads only the <tr> that has class="Odd". After that it finds all the "encrypted" <a> and decodes them into a string and writes them into an array. After that there is a small example of how to get an attribute value from one anchor.
private void HtmlParser(string url)
{
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags=true;
GetHTML(url);
htmlDoc.Load("x.html", Encoding.ASCII, true);
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//table[#class='DataGrid']/descendant::*/tr[#class='Odd']/td/script");
List<string> urls = new List<string>();
foreach(HtmlNode x in nodes)
{
urls.Add(ConvertStringToUrl(x.InnerText));
}
Console.WriteLine(ReadingTheAnchor(urls[0]));
}
private string ConvertStringToUrl(string octUrl)
{
octUrl = octUrl.Replace("IPDecode(\"", "");
octUrl = octUrl.Remove(octUrl.Length -2);
octUrl = octUrl.Replace("%", "");
string ascii = string.Empty;
for (int i = 0; i < octUrl.Length; i += 2)
{
String hs = string.Empty;
hs = octUrl.Substring(i,2);
uint decval = System.Convert.ToUInt32(hs, 16);
char character = System.Convert.ToChar(decval);
ascii += character;
}
//Now you get the <a> containing the links. which all can be read as seperate html files containing just a <a>
Console.WriteLine(ascii);
return ascii;
}
private string ReadingTheAnchor(string anchor)
{
//returns url of anchor
HtmlDocument anchorHtml = new HtmlAgilityPack.HtmlDocument();
anchorHtml.LoadHtml(anchor);
HtmlNode h = anchorHtml.DocumentNode.SelectSingleNode("a");
return h.GetAttributeValue("href", "");
}
//using OpenQA.Selenium; using OpenQA.Selenium.Firefox;
private void GetHTML(string url)
{
using (var driver = new FirefoxDriver())
{
driver.Navigate().GoToUrl(url);
Console.Clear();
System.IO.File.WriteAllText("x.html", driver.PageSource);
}
}

Adding dynamic values in an input box using watin for a web based application

I am working on a web based application which takes care of online orders placed by customers .
we are using watin for sanity.This is what my code reads
mybrowser.TextField(Find.ByName("searchBox")).Value = "milk";
mybrowser.Image(Find.ByName("search")).Click();
In the input field i want to add any string value(e.g meat/bakery) of X length
Please help
As I read your question you want to be able to generate and input a string of a specific, given length X. Here is some code I wrote a while ago for that.
public static string GetRandomString(int length)
{
StringBuilder randomString = new StringBuilder();
Random randomNumber = new Random();
for (int i = 0; i < length; i++)
{
randomString.Append(Convert.ToChar(Convert.ToInt32(Math.Floor(26 * randomNumber.NextDouble() + 65))));
}
return randomString.ToString();
}
If you want to pick from a specific list of items I would use code like this
public static string GenerateRandomFood()
{
string[] foods = {"Bread", "Cheese", "Milk", };
// There are 3 food names
return foods[GetRandomInt(0, 2)];
}
Hope that is what you are after.

Resources