Source website is here on Nethys
Since I don't know all of the terminologies I'm going to keep this as neutral as possible. I'm trying to gather information from this website into separate columns in a google doc.
I want the bold text in one column, the associated link in the next, and the spell description in another. The issue comes when a description references another spell they put it in italics which breaks the description into multiple parts seen in C153 and C154. I think it would be easier to just grab everything in between the bold text and a line break but I don't know the context.
From an example such as (Forgive me if the formatting is wrong, I'm mostly guessing here);
<p>
<b>
<a href='link1'>
Bold Link 1
</a>
</b>
:Followed by normal text
<br>
<b>
<a href='link2'>
Bold Link 2
</a>
</b>
:Normal Text
<i>with an italic</i>
in between
<br>
<b>
<a href='link3'>
Bold Link 3
</a>
</b>
:Back to this one
<br>
</p>
I can get it to return
:Followed by normal text
Normal text
in between
:Back to this one
But I want it to return :Followed by normal text :Normal text with an italic in between :Back to this one
I don't even know if it's possible to do with a single command but any help would be appreciated.
If you want to select every text node descendant of p root element that it's not also descendant of a you could use this XPath:
/p//text()[not(ancestor::a)]
Or more restricted ussing the Kayian method:
/p//text()[count(.|/p//a//text()) != count(/p//a//text())]
Note: XPath 1.0 has no intersection nor set differenciation operators, but it has union by | operator and cardinality by count() function. Dr. Michael Kay discovered that those were enough to test for set membership: a element is member of B set if and only if {a} union B has the same cardinality than B. From there you build all the others set operations.
Related
I have an HTML page which contains the following:
<div class="book-info">
The book is <i>Italicized Title</i> by Author McWriter
</div>
When I view this in Chrome Dev Tools, it looks like:
<div class="book-info">
"The book is "
<i>Italicized Title</i>
" by Author McWriter"
</div>
I need a way to find this single div using XPath.
Constraints:
There are many book-info divs on the page, so I can't just look for a div with that class.
Any part of the text within the book-info div might also appear in another, but the complete text within the div is unique. So I want to match the entire text, if possible.
It is not guaranteed that an <i> will exist within the book-info div. The following could also exist, and I need to be able to find it as well (but my code is working for this case):
<div class="book-info">
"Author McWriter's Legacy"
</div>
I think I can detect whether the div I'm looking for contains an <i> or not, and construct a different XPath expression depending on that.
Things I have tried:
//div[text()=concat("The book is ","Italicized Title"," by Author McWriter")]
//div[text()=concat("The book is ","<i>Italicized Title"</i>," by Author McWriter")]
//div[text()=concat("The book is ",[./i[text()="Italicized Title"]," by Author McWriter")]
//div[concat(text()="The book is ", i[text()="Italicized Title"],text()=" by Author McWriter")]
None of these worked for me. What XPath expression would?
You can use this combination of XPath-1.0 predicates in one expression. It matches both cases:
//div[#class="book-info" and ((i and contains(text()[1],"The book is") and contains(text()[2],"by Author McWriter")) or (not(i) and contains(string(.),"Author McWriter's Legacy")))]
I have the following HTML, and I need to get the text that is outside of the bold tag. For instance 'Submitted At:' I need to get the timestamp that follows. You will see that 'Submitted At: is surrounded by bold tags and the timestamp follows and I can not retrieve it.
<body>
<h2> … </h2>
<b> … </b>
jenkins
<br></br>
<b> … </b>
<br></br>
<b> … </b>
…
<br></br>
<b> … </b>
<br></br>
<b>
Submitted At:
</b>
29-Jan-2016 17:12:24
Things I have tried.
#browser.body.text.split("\n")
#browser.body.split("\n")
body_html = Nokogiri::HTML.parse(#browser.body.html)
body_html.xpath("//body//b").text
returned: "User: JobName: JobConf: Job-ACLs: All users are allowedSubmitted At: Launched At: Finished At: Status: Analyse This Job"
I have tried several things such as xpath, plain old text retrieval, but I am not able to get what I need. I have also done several searches and can't find what I need.
To start with, html bereft of classes and ids is always going to provide a challenge. It is going to be even worse when you want to access text that is merely in the body tag.
In this specific instance, this should work:
browser.b(index: 4)
InnerHtml is literally what it is - its inside a HTMLstart and end tag. So you are looking at InnerHtml of the outer tag actually - <body>.
The .text of <Body> tag will give you entire text. If the tags are gonna be dynamic index is not going to work. So if you know the timestamp length is gonna always be same, Get the entire text, delimit/unstring based on this string 'Submitted At:' to max timestamp length. This will be stable solution rather than a hardcoded Index value if it may change. Ie pickup substring starting from that tag to max length of timestamp.
The HTML appears to have a structure of:
a <b> tag that is the field description and
a following text node that is the field value.
Watir can only return the concatenation of all an element's text nodes. As a result, it does not deal well with this structure, which needs the text nodes separated. While you could parse the concatenated String, it could be error prone depending on the possible field descriptions/values.
I would therefore suggest parsing the HTML with Nokogiri as it can return individual text nodes. This would look like:
html = browser.html
doc = Nokogiri::HTML(html)
p doc.at_xpath('//b[normalize-space(text()) = "Submitted At:"]
/following-sibling::text()[1]').text.strip
#=> "29-Jan-2016 17:12:24"
Here we are using an XPath to find the <b> tag that contains the relevant field description, "Submitted At:". From that node, we find the text node, ie the "29-Jan-2016 17:12:24", that comes right after it.
I have this html snippet
<div id="overview">
<strong>some text</strong>
<br/>
some other text
<strong>more text</strong>
TEXT I NEED IS HERE
<div id="sub">...</div>
</div>
How can I get the text I am looking for (shown in caps)?
I tried this, I get an error message saying not able to locate the element.
"//div[#id='overview']/strong[position()=2]/following-sibling"
I tried this, I get the div with id=sub, but not the text (correctly so)
"//div[#id='overview']/*[preceding-sibling::strong[position()=2]]"
Is there anyway to get the text, other than doing some string matching or regex with contents of overview div?
Thanks.
following-sibling is the axis, you still need to specify the actual node (in your example the XPath processor is searching for an element named following-sibling). You separate the axis from the node with ::.
Try this:
//div[#id='overview']/strong[position()=2]/following-sibling::text()[1]
This specifies the first text node after the second strong in the div.
If you always want the text immediately preceding the <div id="sub"> then you could try
//div[#id='sub']/preceding-sibling::text()[1]
That would give you everything between the </strong> and the opening <div ..., i.e. the upper case text plus its leading and trailing new lines and whitespace.
Using Nokogiri, I want to fetch the part of the paragraph that comes after the <span> tags.
I am no regex hero, and it is the only thing that I need to discover before I can move forward. The only constant in the list is the | symbol, and the ugly way is to get the whole thing and split and join it I guess. Hopefully, there is a smarter, more elegant way!
<ul>
<li>
<p>
<strong>I don't care about </strong>
<span>|</span>
this I do care about
</p></li> ...
</ul>
If your HTML is that simple, then this will work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ul>
<li>
<p>
<strong>I don't care about </strong>
<span>|</span>
this I do care about
</p></li> ...
</ul>
EOT
doc.at('p').children.last # => #<Nokogiri::XML::Text:0x3ff1995c5b00 "\nthis I do care about\n">
doc.at('p').children.last.text # => "\nthis I do care about\n"
Parsing HTML and XML is really a matter of looking for landmarks that can be used to find what you want. In this case, <span> is OK, but getting the content you want based on that isn't quite as easy as looking up one level, to the <p> tag, grabbing its content, the children, selecting the last node in that list, which is text node containing the text you want.
The reason using the <span> tag is not the way I'd go is, if the HTML formatting changes, the number of nodes between <span> and your desired text could change. Intervening text nodes containing "\n" could be introduced for the formatting of the source, which would mess up a simple indexed lookup. To work around that, the code would have to ignore blank nodes and find the one that wasn't blank.
I am no regex hero...
And you shouldn't try to be with HTML or XML. They're too flexible and can confound regular expressions unless you're dealing with extremely trivial searches on very static HTML, which isn't very likely in the real internet unless you're scanning abandoned pages. Instead, learn and rely on decent HTML/XML parsers, that can reduce a page into a DOM, making it easy to search and traverse the markup.
I've been playing around with scrapy and I see that knowledge of xpath is vital in order to leverage scrapy sucessfully. I have a webpage I'm trying to gather some information from where the tags are formatted as such
<div id = "content">
<h1></h1>
<p></p>
<p></p>
<h1></h1>
<p></p>
<p></p>
Now the heading contains a title and the first 'p' contains data1 and the second 'p' contains data2. This seems like a pretty straight forward task, and if this were always the case I would have no problem i.e. hsx.select('//*[#id="content"]') etc. etc.
The problem is, sometimes there will only be ONE p tag following a header instead of two.
<div id = "content">
<h1></h1>
<p></p> (a)
<h1></h1>
<p></p> (b)
<p></p> (c)
What i would like is if there is a paragraph tag missing I want to store that information as just blank data in my list. Right now what happens is the lists are storing the first heading 1, the first paragraph tag(a), and then the paragraph tag under the second h1 (b).
What it should be doing is storing
title -> h1[0]
data1[0] -> (a)
data2[0] ->[]
I hope that makes sense. I've been looking for a good xpath or scrapy solution to do this but I can't seem to find one. Any helpful tips would be awesome. thanks
Use:
//div[#id='content']
/h1[1]/following sibling::*
[not(position()>2)][self::p]
This selects the (utmost) two immediate sibling elements, only if they are p, of the first h1 child of any div (we know that this must be just one div) the string value of whoseidattribute is"content"`.
If only the first immediate sibling is a p, then the returned node-list contains only one item.
You can check whether the length of the returned node-list is 1 or 2, and use this to build the control of your processing.
I think you'd want something like this; not 100% though / untested.
//h1/following-sibling::*[2][self::p]/text()|//h1[not(following-sibling::*[2][self::p])]/string('')