Work-around a StackOverflowException - html-agility-pack

I'm using HtmlAgilityPack to parse roughly 200,000 HTML documents.
I cannot predict the contents of these documents, however one such document causes my application to fail with a StackOverflowException. The document contains this HTML:
<ol>
<li><li><li><li><li><li>...
</ol>
There are roughly 10,000 <li> elements nested like that. Due to the way HtmlAgilityPack parses HTML it causes a StackOverflowException.
Unfortunately a StackOverflowException is not catchable in .NET 2.0 and later.
I did wonder about setting a larger size for the thread's stack, but setting a larger stack size is a hack: it would cause my program to use a lot more memory (my program starts about 50 threads for processing HTML, so all of these threads would have the increased stack size) and would need manually adjusting if it ever came across a similar situation again.
Are there any other workarounds I could employ?

I just patched an error that I believe is the same as your describing. Uploaded the patch to the hap project site...
http://www.codeplex.com/site/users/view/sjdirect (see the patch on 3/8/2012)
Or see more documentation of the issue and result here....
https://code.google.com/p/abot/issues/detail?id=77
The actual fix was...
Added HtmlDocument.OptionMaxNestedChildNodes that can be set to prevent StackOverflowExceptions that are caused by tons of nested tags. It will throw an ApplicationException with message "Document has more than X nested tags. This is likely due to the page not closing tags properly."
How I'm Using Hap After Patch...
HtmlDocument hapDoc = new HtmlDocument();
hapDoc.OptionMaxNestedChildNodes = 5000;//This is what was added
string rawContent = GETTHECONTENTHERE
try
{
hapDoc.LoadHtml(RawContent);
}
catch (Exception e)
{
//Instead of a stackoverflow exception you should end up here now
hapDoc.LoadHtml("");
_logger.Error(e);
}

Ideally, the long-term solution is to patch HtmlAgilityPack to use a heap-stack instead of the call-stack, but that would be an undertaking too big for me. I've temporarily lost my CodePlex account details, but when I get them back I'll submit an Issue report on the problem. I also note that this issue could present a Denial-of-Service attack vulnerability to any site that uses HtmlAgilityPack to sanitize user-submitted HTML - a crafted overly-nested HTML document would cause the w3wp.exe process to die.
In the meantime, I figured the best way forward is to manually override the maximum thread stack size. I was wrong in my earlier statement that a bigger stack-size means that all threads automatically consume that memory (it seems memory pages are allocated for a thread stack as it grows, not all-at-once).
I made a copy of the <ol><li> page and ran some experiments. I found that my program failed when the stack size was less than 2^21 bytes (2MB) in size, but a maximum size of 2^22 bytes (4MB) succeeded - and 4MB in my book passes as an "acceptable" hack... for now.

This should work:
HtmlDocument.MaxDepthLevel = 10000;
var doc = new HtmlDocument();
try
{
doc.LoadHtml(document);
}
catch(Exception ex)
{
Console.WriteLine("Exception while loading html: " + ex);
yield break;
}

Related

How to clear UI memory (heap) in WPF page after closing it

Here I have a List<DocuemntModel>, where DocumentModel holds property byte[] XpsData.
After converting byte[] XpsData into XPSDocument to bind in <DocumentViewer>.
public static XpsDocument ByteToXpsDocument(byte[] sourceXPS)
{
temp++;
MemoryStream ms = new MemoryStream(sourceXPS);
string memoryName = "memorystream://ms" + temp + ".xps";
Uri memoryUri = new Uri(memoryName);
try
{
PackageStore.RemovePackage(memoryUri);
}
catch (Exception) { }
Package package = Package.Open(ms);
PackageStore.AddPackage(memoryUri, package);
XpsDocument xps = new XpsDocument(package,
CompressionOption.SuperFast, memoryName);
return xps;
}
List<DocumentModel> will load 300+ objects to bind. For each object, it takes 400+kb the size of XPSDocument .
I have successfully done Bindings. But when the page has been loaded the App size increases 300MB . Because of loading all 300+ XPSDocument in page UI.
After I close the current page, the App memory size remains stable. (What I'm expecting is after closing a page it will release all its memory and the App size will get back to its initial size.) and it is not happening.
When I go back and come again to this same page with another 300+ data, the app size increases 500+. App getting slow by slow. Memory is also holding previous data(not required anymore) of this current page.
Now, please read the 5th point again to understand my issue and help me with this.
*My googles
- How to release UI memory in WPF.
- How to clear Heap memory in WPF #.*
Expectation:
The 4th time coming to the same page.xaml, the app size increases to 2400MB and it gets slow and freezes until its loading process completely.
But for the first time, loading does not take a long time (4sec).
What I expect is, For my N time loading the same page will load as like the first time.

Locale string comparison does not work properly in Firefox extension web worker

The localeCompare() function does not behave the same in a Firefox extension main code and in a web worker (or chrome worker).
For instance, in the main code, I have this code:
var array = ["École", "Frère", "frère", "école"];
array.sort(function(a, b) {
return a.localeCompare(b);
});
console.log('Main: ' + array);
it shows:
Main: �cole,�cole,Fr�re,fr�re
Which is the right sorting (the encoding is not my problem).
In the worker, I have this code:
var array = ["École", "Frère", "frère", "école"];
array.sort(function(a, b) {
return a.localeCompare(b);
});
self.postMessage(array);
it prints:
Frère,frère,école,�0cole
which is in the wrong order (once again, the encoding is not my problem).
The sorting in the main code is ok, but not the one in the web worker.
I tried to change the options of the localeCompare() function in the web worker, but it does not change anything.
Why is the sorting different in the web worker and how to get it right in the web worker?
(For some reason, I could not send the data to the main code, do the sorting and send it back to the web worker. I still got the wrong order (gives me école,�0cole,Frère,frère).)
Thanks for your help.
localeCompare is still broken in Firefox Web Workers.
Wladimir mentioned Bug 616841, which indeed fixed it almost everywhere... except for web workers, which were left broken because the Intl backend was (is?) not thread-safe, or some other thread-safety issues. The corresponding "Dead end" patch was never reviewed nor checked in.
I now filed Bug 903780, with a test case based on your code, so that localeCompare hopefully will be fixed in the future.

How would I approach updating all people currently on a page's pages when someone triggers an event?

For example, someone enters a comment onto a page — how would I cause all other browsers on that page to say reload it to receive the new update? Specifically I am looking at 'dynamically' updating NOT 'periodically' updating. Minimal bandwidth for maximal efficiency.
I am no expert programmer — please simplify the solution if possible.
There is HTTP Server Push technology, but from what I heared, I would always prefer polling (consumes less server resources and is more compatible).
You could use websockets, which are part of HTML5.
var socket = new WebSocket("ws://example.com:8000/websocket.php");
socket.onmessage = function(msg) {
// update content here
alert("New content:" + msg);
}
However, you would have to have a server-side technology that supports it.

Usage of relative=up in selenium

Can any one explain me the usage of
selenium.selectFrame("relative=up");
sample code:
selenium.selectFrame("frame");
String Error_MSG_1 = selenium.getText("//div");
selenium.selectFrame("relative=up"); -----> here if I remove this
statement it throws an exceptions
if (selenium.isTextPresent("error message")) {
assertEquals("","");
}
//Close error pop-up
selenium.click(Close_popup);
If your web applications implement iframes, often times, while testing, say, a text string, you can clearly see it being displayed in the browser, but upon playback, the selenium script may fail. This is because the script may not be placing the right iframe into context. selenium.selectFrame(...) is used to set the right frame in which the assertion/verification is to be performed.
Specifically, selenium.selectFrame(“relative=up”) is used to move one iFrame up a level. In a related manner, you can use selenium.selectFrame(“relative=top”) to select the top level iFrame.

Selenium Firefox Open timeout

Using Windows 2008, C#, Firefox 3.5.1, Selenium RC (v1.0.1)
When it works, this code executes very quickly and the page loads within .5 seconds.
However, the session always seems to fail after 3 - 5 iterations. The open command will cause a window to be spawned, but no page to be loaded. Eventually a timeout exception is returned. The page has not actually timed out. Instead, it is as though the request for a URL has never reached the browser window.
class Program
{
static void Main(string[] args)
{
for (int i = 0; i < 10; i++)
{
var s = new DefaultSelenium("localhost", 4444, "firefox", "http://my.server");
s.Start();
s.SetSpeed("300");
s.Open("/");
s.WaitForPageToLoad("30000");
s.Type("//input[contains(#id, '_username')]", "my.test");
s.Type("//input[contains(#id, '_password')]", "password");
s.Stop();
}
}
}
I have a similar set up (Firefox 3.6.15, Selenium RC 1.0.1, but on WinXP and using the Python libraries) and I am working with a couple of sites - one site is naturally prone to timeouts in normal use (e.g. by a human user) whereas the others typically are not. Those that aren't appear a little slower but the one that is prone to timeouts is significantly slower when run via RC than by a person - it won't always timeout but the incidence is much much more common.
My limited mental model for this is that somehow the extra steps RC is doing (communicating with the browser, checking what it sees in the returned pages etc etc) are somehow adding a bit to each step of the page loads and then at some point they will push it over the edge. Obviously this is overly simplified, I just haven't had time to properly investigate.
Also, I do tend to notice that the problem gets worse over time, which fits a little with what the OP has seen (i.e. working the first time but not after 3 - 5 attempts). Often a reboot seems to fix the issues, but without proper investigation I can't tell why this helps, perhaps it is somehow freeing up memory (the machine is used for other things), getting allocated to a different one of our company's proxies or something else I haven't considered.
So... not much of a full answer here (a comment would have been more appropriate, but my login isn't able to yet), but at least it reinforces that you're not the only one. Periodic restarts are an annoying thing to need to do, but in the absence of any smarter analysis and answers, maybe they'd be worth a shot?
I was facing the same problem .This is because open method of DefaultSelenium has timeout of 30000ms, so it waits for 30s for your page to load. You can try this trivial solution.
//selenium is DefaultSelenium instance as private member of the class
boolean serverStartTry = false;
int tryCount =1;
while((!serverStartTry) && tryCount <= Constants.maxServerTries){
try{
this.selenium.open(ReadConFile.readcoFile("pageName"));
System.out.println("Server started in try no: "+tryCount);
serverStartTry =true;
}catch (SeleniumException e) {
System.out.println("Server start try no: "+tryCount );
System.out.println("Server Start Try: "+ serverStartTry);
serverStartTry = false;
tryCount++;
}
}
if(!serverStartTry){
System.out.println("Server Not started, no. of attempts made: "+tryCount);
System.exit(0);
}
I've solved using:
selenium.setTimeout("60000");
before open instruction.

Resources