scrapy: Remove elements from an xpath selector - xpath

I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.
Here's the gist.
<div id="easy-id">
<stuff I don't want>
text I don't want
<div id="another-easy-id" more stuff I don't want>
text I want
<stuff I want>
...
<more stuff I want>
text I want
...
<div id="one-more-easy-id" more stuff I *don't* want>
<more stuff I *don't* want>
NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"
Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.
Any suggestions?

I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.
Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.
In that case you might want
//div[#id = 'another-easy-id']/following:node()
[not(preceding::div[#id = 'one-more-easy-id']) and not(#id = 'one-more-easy-id')]
If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.
EDIT: Here's the syntax I went with in the end. (See comments for the reasons.)
//div[#id='easy-id']/div[#id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[#id='another-easy-id']]

Related

Trouble accessing a text with XPath query

I have this html snippet
<div id="overview">
<strong>some text</strong>
<br/>
some other text
<strong>more text</strong>
TEXT I NEED IS HERE
<div id="sub">...</div>
</div>
How can I get the text I am looking for (shown in caps)?
I tried this, I get an error message saying not able to locate the element.
"//div[#id='overview']/strong[position()=2]/following-sibling"
I tried this, I get the div with id=sub, but not the text (correctly so)
"//div[#id='overview']/*[preceding-sibling::strong[position()=2]]"
Is there anyway to get the text, other than doing some string matching or regex with contents of overview div?
Thanks.
following-sibling is the axis, you still need to specify the actual node (in your example the XPath processor is searching for an element named following-sibling). You separate the axis from the node with ::.
Try this:
//div[#id='overview']/strong[position()=2]/following-sibling::text()[1]
This specifies the first text node after the second strong in the div.
If you always want the text immediately preceding the <div id="sub"> then you could try
//div[#id='sub']/preceding-sibling::text()[1]
That would give you everything between the </strong> and the opening <div ..., i.e. the upper case text plus its leading and trailing new lines and whitespace.

Regex encapsulate full line and surround it

I can find examples of surrounding a line but not surrounding and replacing, and I'm a bit new to Regex.
I'm trying to ease up my markdown, so that I do not need to add in html just to get it to center images.
With pandoc, I apparently need to surround and image with DIV tags to get it to be centered, right justified, or what ever.
Instead of typing that every time, I'd like to just preprocess my markdown with a ruby script and have ruby add in the DIV's for me.
So I can type:
center![](image.jpg)
and then run a ruby script that will change it to
<div class="center">
![](image.jpg)
</div>
I want the regex to find "center!" and get rid of the word "center" and surround the rest with DIV tags.
How would I accomplish this?
A little example using gsub:
s = "a\ncenter![](image.jpg)\nb\n"
puts s.gsub(/^center(.*)$/, "<div class=\"center\">\n\\1\n</div>")
Result is:
a
<div class="center">
![](image.jpg)
</div>
b
Should get you started. The (.*) captures the content after center, and \\1 adds it back into the replacement. In this example I assumed that the item was on a line by itself - ^ indicates the start of a line and $ indicates the end of a line. If that isn't the case, you'll need to determine what makes what your regex unique so that it doesn't replace any random usage of "center" in your text.

How to get node text without children?

I use Nokogiri for parse the html page with same content:
<p class="parent">
Useful text
<br>
<span class="child">Useless text</span>
</p>
When I call the method page.css('p.parent').text Nokogiri returns 'Useful text Useless text'. But I need only 'Useful text'.
How to get node text without children?
XPath includes the text() node test for selecting text nodes, so you could do:
page.xpath('//p[#class="parent"]/text()')
Using XPath to select HTML classes can become quite tricky if the element in question could belong to more than one class, so this might not be ideal.
Fortunately Nokogiri adds the text() selector to CSS, so you can use:
page.css('p.parent > text()')
to get the text nodes that are direct children of p.parent. This will also return some nodes that are whtespace only, so you may have to filter them out.
You should be able to use page.css('p.parent').children.remove.
Then your page.css('p.parent').text will return the text without the children nodes.
Note: the page will be modified by the remove

Get XPath inner text without some but not all child

I have some HTML like this
<div>
<a>link that I do not want to get</a>
<div>Div that I do not want to get</div>
Text I want to get
<br> I like brs
<b>That text I also want, because I like bold text</b>
<div>I do not want all divs</div>
</div>
And I'd like to use xpath to extract out just the
Text I want to get
<br> I like brs
<b>That text I also want, because I like bold text</b>
In other words I want all DIV childs, but not a and not div.
How can I do this?
You can use self::a to detect a elements, and then use not to exclude them, i.e.:
/div/node()[not(self::a or self::div)]

Convert HTML to plain text and maintain structure/formatting, with ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.
First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

Resources