XPath/Scrapy crawling weirdly formatted pages - xpath

I've been playing around with scrapy and I see that knowledge of xpath is vital in order to leverage scrapy sucessfully. I have a webpage I'm trying to gather some information from where the tags are formatted as such
<div id = "content">
<h1></h1>
<p></p>
<p></p>
<h1></h1>
<p></p>
<p></p>
Now the heading contains a title and the first 'p' contains data1 and the second 'p' contains data2. This seems like a pretty straight forward task, and if this were always the case I would have no problem i.e. hsx.select('//*[#id="content"]') etc. etc.
The problem is, sometimes there will only be ONE p tag following a header instead of two.
<div id = "content">
<h1></h1>
<p></p> (a)
<h1></h1>
<p></p> (b)
<p></p> (c)
What i would like is if there is a paragraph tag missing I want to store that information as just blank data in my list. Right now what happens is the lists are storing the first heading 1, the first paragraph tag(a), and then the paragraph tag under the second h1 (b).
What it should be doing is storing
title -> h1[0]
data1[0] -> (a)
data2[0] ->[]
I hope that makes sense. I've been looking for a good xpath or scrapy solution to do this but I can't seem to find one. Any helpful tips would be awesome. thanks

Use:
//div[#id='content']
/h1[1]/following sibling::*
[not(position()>2)][self::p]
This selects the (utmost) two immediate sibling elements, only if they are p, of the first h1 child of any div (we know that this must be just one div) the string value of whoseidattribute is"content"`.
If only the first immediate sibling is a p, then the returned node-list contains only one item.
You can check whether the length of the returned node-list is 1 or 2, and use this to build the control of your processing.

I think you'd want something like this; not 100% though / untested.
//h1/following-sibling::*[2][self::p]/text()|//h1[not(following-sibling::*[2][self::p])]/string('')

Related

XPath: How do I find a page element which contains another element, using the full text of both?

I have an HTML page which contains the following:
<div class="book-info">
The book is <i>Italicized Title</i> by Author McWriter
</div>
When I view this in Chrome Dev Tools, it looks like:
<div class="book-info">
"The book is "
<i>Italicized Title</i>
" by Author McWriter"
</div>
I need a way to find this single div using XPath.
Constraints:
There are many book-info divs on the page, so I can't just look for a div with that class.
Any part of the text within the book-info div might also appear in another, but the complete text within the div is unique. So I want to match the entire text, if possible.
It is not guaranteed that an <i> will exist within the book-info div. The following could also exist, and I need to be able to find it as well (but my code is working for this case):
<div class="book-info">
"Author McWriter's Legacy"
</div>
I think I can detect whether the div I'm looking for contains an <i> or not, and construct a different XPath expression depending on that.
Things I have tried:
//div[text()=concat("The book is ","Italicized Title"," by Author McWriter")]
//div[text()=concat("The book is ","<i>Italicized Title"</i>," by Author McWriter")]
//div[text()=concat("The book is ",[./i[text()="Italicized Title"]," by Author McWriter")]
//div[concat(text()="The book is ", i[text()="Italicized Title"],text()=" by Author McWriter")]
None of these worked for me. What XPath expression would?
You can use this combination of XPath-1.0 predicates in one expression. It matches both cases:
//div[#class="book-info" and ((i and contains(text()[1],"The book is") and contains(text()[2],"by Author McWriter")) or (not(i) and contains(string(.),"Author McWriter&apos;s Legacy")))]

When adding text() to my XPath, the number of results are duplicated. Why?

The following Xpath executed in Chrome's web inspector returns the expected number, 13, of nodes
//*[#id="day1"]//span[contains(#class, 'day-time-clock')]
However, when I add text() to it:
//*[#id="day1"]//span[contains(#class, 'day-time-clock')]/text()
it returns 26 nodes. However, only every other hit actually points somewhere in the source code, the others are just "numb".
The end node looks like this
<span class="medium bold day-time-clock">
09:00
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some text
</div>
</div>
</span>
The code sample above doesn't show exactly how it looks in the web inspector, there are a couple of empty rows in the text of this node. Here is a small screenshot of how it really looks.
Why is this happening? And what can I do about it?
Your span elements have multiple text node children. Some of the text node children contain only whitespace. In your example, the outer span element has one text node child containing "....09:00...." where "...." represents whitespace, plus one text node child immediately following the child div element. (Incidentally, my HTML is rusty, but I didn't think that having a div inside a span was allowed.)
Your second (inner) span element contains no text nodes, so /text() on this should select nothing.
Generally, using /text() in XPath is a bad idea unless you have some very good reason and know exactly what you are doing.

How to get number of list element (ul tag) of HTML using Get matching XPath count?

I'm kind of new to XPATH-query. I use RF and selenium2library and the XPath Helper-plugin in chrome to test a certain website page. I'm new to HTML/CSS/JavaScript as well.
The web page consists of two ULs (lists) for left and right sides of the page and each one has a few LIs which have few divisions comprised of widgets (JPEG images etc).
I need to count this list rows (number of LIs in each UL). I have already done the samething in a drop down menu to count its elements with no problem (perhaps because it was considered
a web element). But right now I use the same "Get Matching Xpath Count" which returns almost the whole page HTML source instead of a number and it then fails.
All my program is based on getting the number of LIs in a UL (of drop down menu, page, table,...). so I wonder what to do now. Here is an example of the HTML code of the page:
<ul class="rqcol" id="col8a580456553ae">
<li class="rqportlet" id="por8a58045655">
<div id="hdrpor8a580" class="rqhdr" onmouseover="RQ.util.showTools(this)" onmouseout="RQ.util.hideTools(this)"> </div> </li>
<li class="rqportlet" id="por8a580456" >
<div id="hdrpor8a581" class="rqhdr" onmouseover="RQ.util.showTools(this)" onmouseout="RQ.util.hideTools(this)"> </div></li>
</ul>
and my code was:
Get Matching Xpath Count | //ul[#id="ccol8a580456553ae"]/li
which does give me some texts plus HTML code.i also tried:
Get Length | //ul[#id="ccol8a580456553ae"]
which doesn't give me 2 but a big number.
An XPath 2.0 expression to count the 'li' for the specific '' would be:
//ul[#id="col8a580456553ae"]/count(li)
Try this new chrome extension
https://chrome.google.com/webstore/detail/relative-xpath-helper/eanaofphbanknlngejejepmfomkjaiic
You've made a typo in the id value - an extra "c" char in the beginning; otherwise the xpath is correct:
${count}= Get Matching Xpath Count //ul[#id="col8a580456553ae"]/li
By the way, the keyword Get Matching Xpath Count is deprecated in the latest version of the SeleniumLibrary, in favour of Get Element Count

How do I retrieve innerhtml using watir webdriver

I have the following HTML, and I need to get the text that is outside of the bold tag. For instance 'Submitted At:' I need to get the timestamp that follows. You will see that 'Submitted At: is surrounded by bold tags and the timestamp follows and I can not retrieve it.
<body>
<h2> … </h2>
<b> … </b>
jenkins
<br></br>
<b> … </b>
<br></br>
<b> … </b>
…
<br></br>
<b> … </b>
<br></br>
<b>
Submitted At:
</b>
29-Jan-2016 17:12:24
Things I have tried.
#browser.body.text.split("\n")
#browser.body.split("\n")
body_html = Nokogiri::HTML.parse(#browser.body.html)
body_html.xpath("//body//b").text
returned: "User: JobName: JobConf: Job-ACLs: All users are allowedSubmitted At: Launched At: Finished At: Status: Analyse This Job"
I have tried several things such as xpath, plain old text retrieval, but I am not able to get what I need. I have also done several searches and can't find what I need.
To start with, html bereft of classes and ids is always going to provide a challenge. It is going to be even worse when you want to access text that is merely in the body tag.
In this specific instance, this should work:
browser.b(index: 4)
InnerHtml is literally what it is - its inside a HTMLstart and end tag. So you are looking at InnerHtml of the outer tag actually - <body>.
The .text of <Body> tag will give you entire text. If the tags are gonna be dynamic index is not going to work. So if you know the timestamp length is gonna always be same, Get the entire text, delimit/unstring based on this string 'Submitted At:' to max timestamp length. This will be stable solution rather than a hardcoded Index value if it may change. Ie pickup substring starting from that tag to max length of timestamp.
The HTML appears to have a structure of:
a <b> tag that is the field description and
a following text node that is the field value.
Watir can only return the concatenation of all an element's text nodes. As a result, it does not deal well with this structure, which needs the text nodes separated. While you could parse the concatenated String, it could be error prone depending on the possible field descriptions/values.
I would therefore suggest parsing the HTML with Nokogiri as it can return individual text nodes. This would look like:
html = browser.html
doc = Nokogiri::HTML(html)
p doc.at_xpath('//b[normalize-space(text()) = "Submitted At:"]
/following-sibling::text()[1]').text.strip
#=> "29-Jan-2016 17:12:24"
Here we are using an XPath to find the <b> tag that contains the relevant field description, "Submitted At:". From that node, we find the text node, ie the "29-Jan-2016 17:12:24", that comes right after it.

Xpath getting node without node child contents

hey guys coudln't get around this. I have an html structured as follow:
<div class="review-text">
<div id="reviewerprofile">
<div id="revimg"></div>
<div id="reviewr">marc</div>
<div id="revdate">2011-07-06</div>
</div>
this is an awesome review
</div>
what i am trying to get is just the text "this is an awesome review" but everytyme i query the node i also get the other content in the childs. using something like this now ".//div[#class='review-text']" how to get just that text only? tank you very much
You're almost there! Just add /text() at the end of your XPath to get the text node.
An XPath expression such as //div returns a set of nodes, in this case div elements. These are in effect pointers to the original nodes in the original tree; the nodes are still connected to their parents, children, ancestors, and siblings. If you see the children of the div element and don't want them, that's not the fault of the XPath processor, it's the fault of whatever software is processing the results returned by the XPath expression.
You can get the text that's an immediate child of the div element by using /text() as suggested. However, that assumes that you know exactly what you are expecting to find in the HTML page - if "awesome" were in italics, it would give you something different.

Resources