XPath "Not". Ignore branches with a specific tag - xpath

I have loaded a web page into the HTML Agility Pack and have a DOM. I want to use XPATH to pull out all of the text on the page (but not the javascript found within <script> tags).
I figure I need a //text() and then a 'not' to ignore any tag within the branch that has a <script> in it.
I have tried
doc.DocumentNode.SelectNodes("//text()[not(self::script)]"))
and
doc.DocumentNode.SelectNodes("//text()[not(script)]"))
but neither work. An example of the XPath property of a node that they return is (notice the Script)
/html[1]/body[1]/div[2]/div[4]/div[1]/div[1]/div[1]/div[3]/script[1]/#text[1]
I have consulted with both of these posts.
Is it possible to do 'not' matching in XPath?
Grab all text from html with Html Agility Pack (This is a good post but it brings out the JS)
Any suggestions?

Your first attempt rejects all text nodes that are script elements, and your second rejects all text nodes that have script node children. Of course, in both cases, the condition is never true.
You haven't explained your requirements clearly, but I guess you want to reject all text nodes that have script elements as their parent, which would be
//text()[not(parent::script)]
or
//*[not(self::script)]/text()

Related

XPATH - how to pick up text in each html element regardless of classes

I am trying to grab some content from webpages that are not structured in a uniform fashion. What I want to do is tell the XPATH to grab any content within html tags in the order it sees them and return the results, without having to specify div names etc, as they are different and not very uniform.
So I need to know how to just say 'return any html content in the order that it's found from within tags, regardless of whether they are classes, ems, strong tags etc. The only experience I have had with XPATH is to specify actual div names, example:
//div[#id='tab_info']
This XPath,
string(/)
will return the string value of the entire XML or HTML document. That is, it'll return a single string of all of the text in document order, as requested.

Get element name by containing text

I'm looking through HTML documents for the text: "Required". What I need to find is the element that holds the text. For example:
<p>... Required<p>
I would get to element name = p
However, it might not be in a <p> tag. It could be in any kind of tag, which is where this question differs from some of the other search text Stack Overflow questions.
Right now I'm using:
page.at(':contains("Required")')
but this only get me the full HTML element
The problem you have is the :contains pseudo class matches any element that has the searched for text anywhere in its descendants. You need to find the innermost element that contains such text. Since html is the ancestor of all elements, if the page contains the text anywhere then html will contain, and so that will be the first matching element.
I’m not sure you can achieve this with CSS, but you can use XPath like this:
page.at_xpath('//*[text()[contains(., "Required")]]')
This finds the first element node that has a text() node as a child that contains Required. When you have that node (if it exists) you can then call name on it to give the name of the element.
For CSS you can do:
page.at('[text()*="Required"]')
It's not real CSS though, or even a jQuery extra.
You should use CSS selectors:
page.css('p').text

xpath: find a node whose content has a provided string

I have some HTML like this:
<div> Make </div>
And I want to match it based on the fact that the content of the node contains the text "Make".
Put another way "Make" is a substring of the div node's content and I want to make such a match on this node using XPath.
The obvious solution would be
//div[contains(., 'Make')]
but this will find all divs that contain the string "Make" anywhere within their content, so not only will it find the example you've given in the question but also any ancestor div of that one, or any divs where that substring is buried deep in a descendant element.
If you only want cases where that string is directly inside the div with no other intervening elements then you'd have to use the slightly more complex
//div[text()[contains(., 'Make')]]
This is subtly different from
//div[contains(text(), 'Make')]
which would look only in the first text node child of the div, so it would find <div>Make<br/>Break</div> but not <div>Break<br/>Make</div>
If you want to allow for intervening elements other than div, then try
//div[contains(., 'Make')][not(.//div[contains(., 'Make'])]
Seems like this is what you are looking for: //div[contains(text(),'Make')]
If this will not work you can try: //div[contains(.,'Make')]. This will find all divs, which contain 'Make' in any attribute.
To find that node anywhere in the document, you would need this:
//div[contains(text(), "Make")]

how to use Xpath with LibXml 2

in this address i am trying to scrape a tage (that is Larg price which is bold red one)
i use LIBXML 2.2
when i try to extract the tag through this XPATH
//*[#class='priceLarge']
it works!
but to make queries easier i would like to use FireBug on Firefox.
Using FireBug it gives me this XPath
/html/body/div[2]/form/table[3]/tbody/tr/td/div/table/tbody/tr[2]/td[2]/span/b
using this Xpath it does not work, seems this one does not give a complete query. how can i modify this XPath to scrape the item ?
Firefox and other browsers generate tbody tags in HTML.
In fact, the tbody is probably not there, so you can remove it in your XPath. (/html/body/div[2]/form/table[3]/tr/td/div/table/tr[2]/td[2]/span/b) You can test this by just saving the HTML from your application and viewing it in a text editor.
Since it seems the intent is to pull information from a web page however, your application will probably be more resistant to changes in the web page if you use XPath less dependent on the tree structure (i.e. //b[#class='priceLarge']).
EDIT: It seems that in addition to the tbody problem, Firefox is rendering the div (ID: divsinglecolumnminwidth) element as containing the form element (ID: handleBuy).
Looking at the html with an XML editor shows that the form element is a sibling of that div element, so the expression should start with /html/body/form/table[3].
One tool, among many others, to test your XPath expressions is HAP Testbed.

XPath Expression

I am new to XPath. I have a html source of the webpage
http://london.craigslist.co.uk/com/1233708939.html
Now I want to extract the following data from the above page
Full Date
Email - just below the date
I also want to find the existence of the button "Reply to this post" on the page
http://sfbay.craigslist.org/sfc/w4w/1391399758.html
Can anyone help me in writing the three XPath expressions for the above three data.
You don't need to write these yourself, or even figure them out yourself. If you use the Firebug plugin, go to the page, right click on the elements you want, click 'Inspect element' and Firebug will popup the HTML in a viewer at the bottom of your browser. Right click on the desired element in the HTML viewer and click on 'Copy XPath'.
That said, the XPath expression you're looking for (for #3) is:
/html/body/div[4]/form/button
...obtained via the method described above.
I noticed that the DTD is HTML 4/01 Transitional and not XHTML for the first link, so there's no guarantee that this is a valid XML document, and it may not be loaded correctly by an XML parser. In fact, I see several tags that aren't properly closed (i.e. <hr>, etc)
I don't know the first one off hand, and the third one was just answered by Alex, but the second one is /html/body/a[0].
As of your first page it's just impossible to do because this is not the way xpath works. In order for an xpath expression to select something that "something" must be a node (ie an element)
The second page is fairly easy, but you need an "id" attribute in order to do that (or anything that can make sure your button is unique). For example if you are sure the text "Reply to this post" correctly identify the button just do it with
//button["Reply to this post"]

Resources