xpath br (line breaks) in p (paragraph) - xpath

Below is an example XML.
<p>
Thisisgood
</p>
<p>
Thisisbad
</p>
<p>
This
<br>
is
<br>
acceptable
</p>
<p>
Thisisfine
</p>
I want the result:
Thisisgood
Thisisbad
Thisisacceptable
Thisisfine
I use Xpath //p/text() in Google Doc (=importXML). This results in:
Thisisgood
Thisisbad
This is acceptable (appearing in different cells)
Thisisfine
What XPath would give me the result I need? Thank you.

You cannot solve this problem using XPath 1.0. Using XPath 2.0, you'd just do a
//p/string-join(text(), '')
but this is not supported by Google Spreadsheet.
I'm pretty sure you can use ARRAYFUNCTION and JOIN in Google Spreadsheet, but cannot help you with this. Better ask a new question with appropriate tags for Google Spreadsheets so people following that tag get notified, and provide an example Spreadsheet using the ImportXML function so people can work with it.

I had the same problem. I used this code
=Trim(JOIN("",L3:X3))
L3:X3 are the cells

//p/ without text() must be enough to get this: Thisisacceptable

Related

Trouble grabbing background image url with importxml / xpath

I'm trying to scrape some background image urls into a google sheet. Here is an example of the container-
<div class="_rs9 _1xcn">
<div class="_1ue-">
<section class="_4gsw _7of _1ue_" style="background-image: url(https://scontent.x.com/v/t64.5771-25/38974906_464042117451453_1752137156853235712_n.png?_nc_cat=100&_nc_ht=scontent.x.com&oh=c19f15536205be2e1eedb7f7fc7cb61b&oe=5C4442FD)">
<div class="_7p2">
</div>
</section>
I need to get from the https to the question mark after png. I know there's a way to use substring-before/-after but I am having a tough time, particular with escaping quotes.
Here is my attempt. This just gets me an "#N/A":
=IMPORTXML(B2,"substring-before(substring-after(//section[#class='_4gsw _7of _1ue_']/#style, """"background-image: url(""""), """")"""")")
Could anyone help with the full importxml statement? Much appreciated, thanks.
Your approach was close. Try the following XPath expression:
substring-before(substring-after(//section[#class='_4gsw _7of _1ue_']/#style, 'background-image: url('),'?')
The whole expression could look like this:
=IMPORTXML(B2,"substring-before(substring-after(//section[#class='_4gsw _7of _1ue_']/#style, 'background-image: url('),'?')")

Extracting links (get href values) with certain text with Xpath under a div tag with certain class

SO contributors. I am fully aware of the following question How to obtain href values from a div using xpath?, which basically deals with one part of my problem yet for some reason the solution posted there does not work in my case, so I would kindly ask for help in resolving two related issues. In the example below, I would like to get the href value of the "more" hyperlink (http://www.thestraddler.com/201715/piece2.php), which is under the div tag with content class.
<div class="content">
<h3>Against the Renting of Persons: A conversation with David Ellerman</h3>
[1]
</p>
<p>More here.</p>
</div>
In theory I should be able to extract the links under a div tag with
xidel website -e //div[#class="content"]//a/#href
but for some reason it does not work. How can I resolve this and (2nd part) how can I extract the href value of only the "here" hyperlink?

Selecting with Xpath in Scrapy

I'm using Scapy to scrape some data from a site and I need help using Xpath to select "data" from the following.
<span class="result_item"><span class="text3"><span class="header_text3">**data**</span><br />
**data**<br />
**data**</span> <span class="phone_button_out"><span class="phone_button" style="margin-top: 0"
onclick="pageTracker._trackEvent('USDSearch','Call Now!F');phone_win.open('name','**data**',27101650,0)">
Call Now!<br />
</span></span>
What statements can I use to select the necessary data? I hope this isn't a stupid question. If it is, please point me in the right direction.
There are multiple data elements to get in the posted html. Assuming that <span class="result_item"> is parent of the items, you can try the following:
To get header:
//span[#class='result_item']/span[#class='header_text3']/text()
To get anchor link data:
//span[#class='result_item']/a/text()
Also, to help with xpaths, install Firebug Addon in Firefox, then FirePath addon on Firebug. Pointing to elements will give you autogenerated xpaths (good for beginners. sometime needs xpath tuning)

HtmlUnit - getTextContent()

I´m working whith HTMLUnit, I need get text content of a HtmlAnchor but only text no more tags html have.
<a class="subjectPrice" href="http://www.terra.es/?ca=28_s&st=a&c=4" title="Opel Zafira Tourer 2.0 Cdti 165 Cv Excellence 5p. -12">
<span class="old_price">32.679€</span>
24.395€
If I execute htmlAnchor.getTextContent() it´s return 32.679€ 24.395€, but I only need 24.395€
Anybody can help me? thanks.
Just use XPath to get the appropriate DomText node. It seems that ./text() taking as a reference the HtmlAnchor should be enough.

how to get xpath of text between <br> or <br />?

</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Assuming the above, how is it possible to use Xpath to grab each fruit ? Must use xpath of some sort.
should i use substring-after(following-sibling...) ?
EDIT: I am using Nokogiri parser.
Well, you could use "//br/text()", but that will return all the text-nodes inside the <br> tags. But since the above isn't well-formed xml I'm not sure how you are going to use xpath on it. Regex is usually a poor choice for html, but there are html (not xhtml) parsers available. I hesitate to suggest one for ruby simply because that isn't "my area" and I'd just be googling...
Try the following, which gets all text siblings of <br> tags as array of strings stripped from trailing and leading whitespaces:
require 'rubygems'
reguire 'nokogiri'
doc = Nokogiri::HTML(DATA)
fruits =
doc.xpath('//br/following-sibling::text()
| //br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end
puts fruits
__END__
</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Is this what you want?
There are several issues here:
XPath works on XML - you have HTML which is not XML (basically, the tags don't match so an XML parser will throw an exception when you give it that text)
XPath normally also works by finding the attributes inside tags. Seeing as your <br> tags don't actually contain the text, they're just in-between it, this will also prove difficult
Because of this, what you probably want to do is use XPath (or similar) to get the contents of the div, and then split the string based on <br> occurrences.
As you've tagged this question with ruby, I'd suggest looking into hpricot, as it's a really nice and fast HTML (and XML) parsing library, which should be much more useful than mucking around with XPath

Resources