Closing anchor tags with HtmlAgilityPack - html-agility-pack

I am using the HtmlAgilityPack to scrape crummy html and get links, raw text, etc. I'm running into a few pages that have inconsistently closed <a> tags, like this:
<html>
<head></head>
<body>
<a href=...>Here's a great link! <a href=...>Here's another one!</a>
Here's some unrelated text.
</body></html>
HAP parses this, and helpfully closes the open <a> tag, but only at the very end of the document:
<html>
<head></head>
<body>
Here's a great link! <a href="...">Here's another one!
Here's some unrelated text.
</a></body></html>
In practice this means that the InnerText of any unclosed link contains all text from the rest of the page, which gets exciting when parsing a page that may contain thousands of unclosed tags and megabytes of text.
So, how can I make HAP close those tags immediately, ideally putting the close just before the next open so that there is never any overlap for an <a>? I've played around with OptionFixNestedTags and OptionAutoCloseOnEnd with no luck, and I've found advice on how to allow overlap, but I'm drawing a blank on actually fixing it.

Related

Scrapy xpath returning more elements than it should

I'm currently going thru a tutorial on Scrapy. Encountering the following issue when using xpath to filter out certain tag elements from an html file for example.
<html>
<head>
<title>Title of the page</title>
</head>
<body>
<h1>H1 Tag</h1>
<h2>H2 Tag with link</h2>
<p>First Paragraph</p>
<p>Second Paragraph</p>
</body>
</html>
The output for the line response.xpath('/html/head/title').extract() returned a list as such:
['<title>Title of the page</title>\n </head>\n <body>\n <h1>H1 Tag</h1>\n <h2>H2 Tag with link</h2>\n <p>First Paragraph</p>\n <p>Second Paragraph</p>\n </body>\n</html>\n'].
It seems like it was able to start from the correct tag but it doesn't stop at the closing tag. Using Visual Studio Code v.1.65.1. Any help would be greatly appreciated.
As you have not provided links or specific HTML that you are trying to parse, it is not possible to reproduce the problem. You do not have a problem in this XPath or this HTML that you posted. See below my results:
In [1]: response.xpath('/html/head/title').extract()
Out[1]: ['<title>Title of the page</title>']
That being said, you have another problem that I am answering here. extract always returns a list, even if there is only one match. The get the first match as string, the method is extract_first.
That's why Scrapy now recommends using get to get the first match as string, and get_all to get the list of strings. See the docs here.

CKEditor moving br tags

I'm having a problem with CKEditor changing my original paragraph formatting with negative side effects.
I start with a basic paragraph loaded into CKEditor using setData():
<p><span style="font-size:50px">My Text</span></p>
... more document content ...
In the editor, I move the cursor to the end of the phrase "My Text" and press enter (with config.enterMode=CKEDITOR.ENTER_BR setting enabled). Inspecting the markup inside the editor I now see:
<p><span style="font-size:50px">My Text<br><br></span></p>
... more document content ...
Then, when I call getData() to pull the contents from the editor and save the document to a database, the HTML extracted by getData() looks like this:
<p><span style="font-size:50px">My Text</span><br> </p>
... more document content ...
This is a problem because while editing, the <br> tag was inside the <span> and was subject to the 50px font size style. The user saw a 50px blank line before the next piece of document content. After saving the HTML to a database and reloading later the <br> tag is now outside the <span> and is not subject to the 50px font sizing and the blank line appears much smaller than before.
The round trip fidelity of the text formatting is not preserved and the user is frustrated by the results.
Can someone help me understand the results I'm seeing with <br> tags being reformatted and moved around during the editing life cycle, and how I might fix this problem?
Using CKEditor v4.4.1

Make Share icons from Add This align horizontally instead of vertically on self hosted WordPress

I have a problem that has taken me days to figure out.
The social follow icons I get from AddThis website appear vertically instead of horizontally. I want to make it appear horizontally but I have found that it is impossible to do so.
Below is the code I got from https://www.addthis.com/get/follow
<!-- AddThis Follow BEGIN -->
<p>Follow Us</p>
<div class="addthis_toolbox addthis_default_style">
<a class="addthis_button_facebook_follow" addthis:userid="TheMostafaAbedi"></a>
<a class="addthis_button_twitter_follow" addthis:userid="theMostafaAbedi"></a>
<a class="addthis_button_google_follow" addthis:userid="106914586115617584077"></a>
</div>
<script type="text/javascript" src="http://s7.addthis.com/js/300/addthis_widget.js#pubid=xa-506a607f490b6601"></script>
<!-- AddThis Follow END -->
The specific page that the problem occurs is http://www.under-review.com/about under Mostafa Abedi description.
You've got line breaks between lines of code, which makes the icons go one below the other. You're most likely using the WordPress editor to insert the code, which alters the formatting.
If you're using Visual editor, switch to HTML editor and give it a try.
If you're using HTML editor, at the very least remove all the spacing between items, making them placed in one line, which would prohibit WordPress from entering new lines, i.e.:
<a class="addthis_button_facebook_follow" addthis:userid="TheMostafaAbedi"></a><a class="addthis_button_twitter_follow" addthis:userid="theMostafaAbedi"></a><a class="addthis_button_google_follow" addthis:userid="106914586115617584077"></a>
From my experience, about 90% of all formatting issues are fixed by good code and as a rule I follow W3C. I suggest you validate against that because at the moment it fails (both the HTML and CSS fails)
http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Funder-review.com%2Fabout%2F
http://jigsaw.w3.org/css-validator/validator?profile=css21&warning=0&uri=http%3A%2F%2Funder-review.com%2Fabout%2F

Convert HTML to plain text and maintain structure/formatting, with ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.
First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

IE8 & FF XHTML error or badly formed span?

I recently have found a strange occurrence in IE8 & FF.
The designers where using js to dynamically create some span tags for layout (they were placing rounded corner graphics on some tabs). Now the xhtml, in js, looked like this: <span class=”leftcorner” /><span class=”rightcorner” /> and worked perfectly!
As we all know dynamically rendering elements in js can be quite processor intensive so I moved the elements from js into the page source, exactly as above.
... and it didn’t work... not only didn’t it work, it crashes IE8.The fix was simple, put the close span in ie: <span class=”leftcorner”></span>
I am a bit confused by this.
Firstly as far as I am aware <span class=”leftcorner” /> is perfectly valid XHTML!
Secondly it works dynamically, but not in XHTML?!?!?
Can anyone shed any light on this or is it simply another odd occurrence of browsers?
The major browsers only support a small subset of self-closing tags. (See this answer for a complete list.)
Depending on how you were creating the elements in JS, the JavaScript engine probably created a valid element to place in the DOM.
I had similar problem with a tags in IE.
The problem was my links looked like that (it was an icon set with the css, so I didn't need the text in it:
<a href="link" class="icon edit" />
Unfortunately in IE these links were not displayed at all. They have to be in
format (leaving empty text didn't work as well so I put there). So what I did is I add an few extra JS lines to fix it as I didn't want to change all my HTML just for this browser (ps. I'm using jQuery for my JS).
if ($.browser.msie) {
$('a.icon').html('&nbsp');
}
IE in particular does not support XHTML. That is, it will never apply proper XML parsing rules to a document - it will treat it as HTML even with proper DOCTYPE and all. XHTML is not always valid SGML, however. In some cases (such as <br/>) IE can figure it out because it's prepared to parse tagsoup, and not just valid SGML. However, in other cases, the same "tagsoup" behavior means that it won't treat /> as self-closing tag terminator.
In general, my advice is to just use HTML 4.01 Strict. That way you know exactly what to expect. And there's little point in feeding XHTML to browsers when they're treating it as HTML anyway...
See I think that one of the answers to Is writing self closing tags for elements not traditionally empty bad practice? will answer your question.
XHTML is only XHTML if it is served as application/xhtml+xml — otherwise, at least as far as browsers are concerned, it is HTML and treated as tag soup.
As a result, <span /> means "A span start tag" and not "A complete span element". (Technically it should mean "A span start tag and a greater than sign", but that is another story).
The XHTML spec tells you what you need to do to get your XHTML to parse as HTML.
One of the rules is "For non-empty elements, end tags are required". The list of elements includes a quick reference to which are empty and which are not.

Resources