HtmlAgility ParseErrors Property - html-agility-pack

What errors can I expect to fix HtmlAgility library? I know from my own experience it can close a missing tag, like:
<car>Nissan</car
When do Load or LoadHtml, it will fix it, like:
<car>Nissan</car>
I also know that ParseErorrs collection can determine Reason, Stream etc.
Is there a list of errors (or can you tell from your own experience) how reliable is HtmlAgility for fixing errors and what errors cannot be fixed by HtmlAgility?

Historically, Html Agility Pack was never designed to fix Html, but rather to be able to load, modify & save it back, even if this Html has errors.
It means it will fix errors that in general are fixed automatically by browsers, like the one you show in your question. The list of errors has been determined experimentally, and you can browse the source for a deep insight about it. That being said, it was actually designed back in 2000/2001 years so things may have changed in that area :-)
The ParseErrors collection will contain HtmlParseError objects with a code. The code is an enum that's documented:
/// A tag was not closed.
TagNotClosed,
/// A tag was not opened.
TagNotOpened,
/// There is a charset mismatch between stream and declared (META) encoding.
CharsetMismatch,
/// An end tag was not required.
EndTagNotRequired,
/// An end tag is invalid at this position.
EndTagInvalidHere
There is also an OptionFixNestedTags property on HtmlDocument (default value is false), that is capable of fixing LI, TR, TH, TD tags when nesting errors are detected. It means if it detects a closing TR without all the needed closing TD, they will be closed automatically. Again, this is exactly what browser will do with malformed Html.

Related

CKEDITOR How to find and wrap text in span

I am writing a CKEDITOR plugin that needs to wrap certain pieces of text in a tag. From a webservice, I have an array of items that need to be wrapped. The array is just the plain text strings. Such as:
"[best buy", "horrible migraine", "eat cake"]
I need to find the instances of this text in the editor and wrap them in a span tag.
This is further complicated because the text may be marked up. So the HTML for "best buy" might be
"<strong>best</strong> buy"
but the text returned from the web service is stripped of any markup.
I started trying to use a CKEDITOR.htmlParser() object, and that seems like it is moderately successful. I am able to catch the parser.onText event and check if the text contains anything in my array.
But then I cannot modify that text. Modifications are not persisted back to the source html. So I think using the htmlParser() is a dead-end.
What is the best way to accomplish this task?
Oh, and as a bonus, I also do not want to lose my user's current cursor position when the changes are displayed.
Here is what I wound up doing and it seems to be working so far.
I created a text filter rule that searches through my array of items for any item that is contained (or partially contained) in the text. If so, it wraps the element in my span.
A drawback here is that I wind up with two spans for items with markup. But in my usecase, this is tolerable.
Then I set the results using:
editor.document.getBody().setHtml(results);
Because of this, I also have to strip this markup back out when this text gets read. I do this using an elements filter on editor.dataProcessor.htmlFilter.
This seems to be working well for my (so far limited) test cases.

How can I find what triggered a dirtyforms popup?

I have a form that normally works with respect to dirtyforms. However, there is one circumstance where a jquery-ui datapicker calendar will pop up the "are your sure" dialog when a date is clicked.
I emphasize that this normally works correctly. The situation is related to the initial conditions of the form data source. Things work when the object being referenced is existing, but not if it is new. So I am sure somewhere there is a difference in the initial conditions of the form. But in theory the form should be identical.
How can I find what is causing the popup so I can fix my issue?
Well, I did find what was causing my problem by comparing the HTML of the working and non-working situations. (Not an easy task since there were many non-relevant differences.)
Seems that the original coder did a strange thing. Left out some Javascript function declarations when the page was "new" but of course did not eliminate the calls on those functions.
So I guess that the javascript errors were the root cause. At least when I include those function declarations everything works correctly.
By default, most anchor links on the page will trigger the dialog. We don't have a hard-coded selector of all potential 3rd party widgets, you must manually take inventory of whether these widgets use hyperlinks and ignore them if they are causing errant behavior.
See ignoring things for more information.
I was unable to reproduce this behavior using Dirty Forms 2.0.0, jQuery UI 1.11.3, and jQuery 1.11.3. However, in previous versions of Dirty Forms, you can probably use the following code to ignore the hyperlink clicks from the DatePicker.
$('.ui-datepicker a').addClass($.DirtyForms.ignoreClass);

how to disable tag validation in ckeditor?

CKeditor apparently automatically creates matching end tags when you enter a start tag. Is there a way to turn this behavior off?
I have a situation where I am creating two blocks of text in an admin program using CKeditor, then I'm using these to paint a page with the first block, some static content, and then the second block. Now I've got a case where I want to wrap the static content in a table. I was thinking, No problem, I'll just put the <table> tag in the first block and the </table> tag in the second block, and the static content will be inside the table. But no, CKeditor insists on closing the table tag in the first block.
In general, I can go to source mode and enter HTML directly, but CKeditor then decides to reformat my tagging. This seems to rather defeat the purpose of having a source mode. (I hate it when I tell the computer what I want and it tells me, No, you're wrong, I know better than you what you want!)
CKEditor produces valid HTML. Valid HTML has to include both - start and end tags. There's no way to change this behaviour without hacking editor. Note that even if you'll force editor to produce content without one of these tags it will then try to fix this and won't do this as you expect. E.g. load:
<p>foo</p></td></tr></table>
And you'll completely loose this table so only regexp based fix on data loading could help. In the opposite case:
<table><tr><td><p>foo</p>
You'll end up with paragraph wrapped with table, so it's better. But what if someone would remove this table from editor contents?
Therefore you should do this integration outside editor - prepend table to contents of one editor and append to contents of second one. You simply cannot force editor to work on partial HTML.

How to retrieve plain text from a formatted website to use in UIWebView

Not sure if what I want to do is possible, but what I am hoping to do is somehow gather certain pieces of text from a website, remove the header, footer, background, all formatting, and place it into my application in a scrollview or something similar...
I'll give you an example... Imagine I was making wikipedia's iPhone app, I want to download the information about the wiki on dogs, without the header, side bars etc, just the text. How would I go about doing this?
I understand that for this I have not provided any example code or what I've tried or started, but that's just because in this case I'm lost! That doesn't mean I want full chunks of code either. Any help will do. If this doesn't work, I will just have to make a 'mobile optimised' version of the webpages I want to include in my app.
Thanks
(Edit: the term I was trying to use was 'strip the web page of its HTML coding')
You may be going about this the wrong way, or perhaps even asking the wrong question.
Does the target website have an API or datafeed of some kind?
Can you get the information you need in JSON or XML format directly from the site?
I think you've misunderstood the technology. HTML is merely the framwork on which the formatting and data is hung.
Parsing the HTML page seems like an awfully big headache, I doubt you'll ever be able to get it to work, because almost all sites these days are partially or wholly generated on the server side, the page is only the result.
Some sites hide the information in memory and others get it dynamically through ajax for example, which means that simply trying to get the data by parsing the HTML will get zero data.
Another issue you should be aware of though, is that simply copying the data from generated websites may open yourself up to copyright issues.
You have to parse the html code and search for the part that you want and "throw" away the part that you do not need. This is more or less like bruteforcing and the code of the website should not change otherwise you are screwed. So you have to write the parser by hand with this method. But maybe there is a atom or rss feed and you can parse this one. This will be much more easier and you are not depending on the website layout because the rss/atom feed is just about the data. For parsing rss you could try out NSXMLParser.
And then you have to make a valid html page out of the data and present it in the UIWebView

IE8 not accepting multiple classes in quirks mode?

I'm running into a situation where IE8 appears to be dropping CSS selectors. I find this hard to believe, but I can't figure out what is happening.
In a .css file I have this declaration:
#srp tr.objectPath.hover td {
border-top:none;
}
However, when I inspect the file in IE8 through the built-in developer tools, the declaration is modified to this:
#srp TR.hover TD {
border-top:medium none;
}
I don't care about the change in case or the restatement of the rule, but dropping the '.objectPath' is a real problem because it targets the rule more broadly than I intend.
I note that this page is, and must stay in, Quirks mode.
Any ideas what is happening?
Thanks!
In Quirks Mode IE 8 renders the page and treats the DOM as IE 5.5 would render. That's the reason IE 8 in Quirks Mode ignores the multiple classes. It is not a bug in IE 8, if you want your page to be parsed and rendered properly, then you must have a proper DOCTYPE set to render the page in Standards Mode.
tr.objectPath.hover is not correct syntax if you are trying to use the hover pseudo-class. The correct syntax would be with a colon (ie tr.objectPath:hover). When the machine is reading your code, it reads objectPath as the tr's class name, but then when it gets to hover it gets rid of the old class name and replaces it with the hover class (whether there are actually any elements belonging to that class or not. Also, if this is the case, then I don't see what you are trying to do by referring to the child of an instance of :hover.
It you are in fact using hover as a class name (which I wouldn't recommend as it could be confusing to people reading your code) and you want the CSS to apply to the td children of a tr that is of both the objectPath and hover classes, you might consider just creating a new class for elements that are of both classes and using that instead (ie. #srp tr.newClass td).
EDIT: Looking further into the matter, it appears that this is (yet) a(nother) known bug in IE. I have tested it out in IETester and it seems to exist in all versions of IE. The only solution I could see on your end is very very messy:
First, it would require using JavaScript in your CSS since you don't have access to anything else. This is possible but very prone to bugs.
Second, it would require creating a getElementsByClass function in that JavaScript that could take multiple class names as parameters. This would be a very sizable chunk of code.
Finally, you would probably want to look into specifying this code to be used only by IE so that users of other browsers don't have to deal with any potential problems from all this stuff.
To clarify, I would NOT recommend doing this. Instead, I would suggest contacting someone who does have access to the HTML source code (assuming you are actually working in partnership with them) so that they could apply the much simpler fix of adding an objectPathhover class to the tr elements that belong to both classes or even to their td children.
It looks like you've got some incorrect syntax in your declaration, but its hard to tell exactly what you're doing. Are you trying to match to a hover state or is there a class actually called 'hover' ?
If going for the state, try:
#srp tr.objectPath:hover td {
...
}
If there is another class, you may need 2 separate declarations:
#srp tr.objectPath td {
...
}
#srp tr.hover td {
...
}

Resources