Xpath: How to select 1st and 2nd number separately in a sentence - xpath

I have a project where I need to parse a webpage for daily updates and print them in my spreadsheet. I am using =importXML()
I only need two numbers from there that are contained in one sentence(a subheader) here is an example
<div class="col-sm-12 text-18 line-height-27">
<h2>Header</h2>
<p class="text-18">
<strong>21 some words 234 another few words</strong>
</p>
<p class="text-18">
Some content ...
</p>
<h2>Header 2</h2>
<p class="text-18">
<strong>12 some words 144 another few words</strong>
</p>
<p class="text-18">
Some old content ...
</p>
//and it goes on and on
</div>
I need to extract only number 21 and 234 separately, printed in each own cell where my other spreadsheet functions use them for other tables.
I can select the whole sentence easily with
//div/p[1]/strong
but after that I dunno how to break the sentence down. Is there any way to select only 1st and 2nd numbers from the sentence?
Can XPath do that? Maybe I better off break the sentence down and extract numbers by google spreadsheet formulas?

You could easily do that with XPath-2.0's fn:replace function (which is not supported by Google).
To achieve that in XPath-1.0, you have to use some tricks. The following is just one approach which heavily depends on the possible values:
concat(substring-before(normalize-space(translate(/div/p[1]/strong,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,;.:-/','')),' '),' - ',substring-after(normalize-space(translate(/div/p[1]/strong,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,;.:-/','')),' '))
This approach replaces all a..z and A..Z characters (and some punctuation) by an empty character. The numbers do remain, and they are splitted with fn:substring-before(...) and fn:substring-after(...).
This is quite complicated, and doesn't work in cases where not all characters are matched.
In this example, the output is
21 - 234

This should work assuming there are only letters and numbers in the sentence. Otherwise you will need to adjust the REGEX. This removes all letters from a to z and is not case sensitive.
=split(REGEXREPLACE(IMPORTXML(url,xpath_query),"(?i)[a-z]","")," ")

One liner with REGEXEXTRACT (SPLIT is not needed):
=REGEXEXTRACT(IMPORTXML(yourURL;"//div/p[1]/strong");"(\d+).+\s(\d+)")

Related

Possible to run two completely different x-path

Can anyone please help me here ?
I want to run two xpath together and store the value, I am not sure if it is possible.
My one xpath is fetching City and second is state
//div[(text()='city')]/following-sibling::div
//div[contains(text(),'state')]/following-sibling::div
As xpath is telling name of city and state is provided in next div of city and state. I want to run both and capture output in string format.
On side note: both xpath is working fine for me.
<div>
<div>City</div>
<div>London</div>
</div>
<--In between some other elements like p, section other divs-->
<div>
<div>state</div>
<div>England</div>
</div>
It sounds like you want to convert the results of the two XPath expressions to strings, and concatenate those strings. The expression below concatenates them (with a single space between) using the XPath concat function.
concat(
//div[(text()='city')]/following-sibling::div,
' ',
//div[contains(text(),'state')]/following-sibling::div
)
One other thing: note that in your example XML the text of the first div is "City" rather than "city". Make sure the strings in your XPath expression match the text exactly because the expression 'City'='city' evaluates to false

xpath how to extract the element itself and one of its child?

I'm fetching data with python requests & xpath.
<div class="test">
<p>pppp</p>
aaa
<em>bbb</em>
ccc
<span>span</span>
</div>
I want to get aaabbbccc.
I tried //div/*[not(self::p) and not(self::span)]//text() to exclude the p and span element, but it only returns bbb.
What is the correct path?
If the element structure is totally predictable and only the content of text nodes varies, then you can use //div/node()[not(self::p|self::span)]/descendant-or-self::text(). Note that this returns a sequence of text nodes, not a single string. This may also return some whitespace text nodes which you may want to filter out with the predicate [normalize-space(.)].
Another possibility would be //text()[not(parent::p|parent::span)].

Xpath: why normalize-space could not remove the empty space and \n?

For the following code:
<a class="title" href="the link">
Low price
<strong>computer</strong>
you should not miss
</a>
I used this xpath code to scrapy:
response.xpath('.//a[#class="title"]//text()[normalize-space()]').extract()
I got the following result:
u'\n \n Low price ', u'computer', u' you should not miss'
Why two \n and many empty spaces before low price was not removed by normalize-space() for this example?
Another question: how to combine the 3 parts as one scraped item as u'Low price computer you should not miss'?
Please try this:
'normalize-space(.//a[#class="title"])'
I already had the same problem, try this:
[item.strip() for item in response.xpath('.//a[#class="title"]//text()').extract()]
Your call to normalize-space() is in a predicate. That means you are selecting text nodes where (the effective boolean value of) normalize-space() is true. You aren't selecting the result of normalize-space: for that you would want
.//a[#class="title"]//text()/normalize-space()
(which needs XPath 2.0)
The second part of your question: just use
string(.//a[#class="title"])
(assuming scrapy-spider allows you to use an XPath expression that returns a string, rather than one that returns nodes).

Regexp to convert tags (similar to BBCode) to HTML

I have set of strings with nested [quote] tags in following format:
[quote name="John"]Some text. [quote name="Piter"]Inner quote.[/quote][/quote]
As you see it is not like ordinary BBCode. So I can't find a suitable regexp for gsub in Ruby to convert them to strings like this:
<blockquote>
<p>Some text.
<blockquote>
<p>Inner quote.</p>
<small>Piter</small>
</blockquote>
</p>
<small>John</small>
</blockquote>
Can anybody please help me with such regexp?
I'm pretty sure that regexes fundamentally can't cope with nesting. What you could do is make it do a minimal match (e.g. only the inner quote levels), replace them, and then repeat as long as you have more matches. Once you've replaced a level it will just be HTML so will not match the regex any more.

regex selection

I have a string like this.
<p class='link'>try</p>bla bla</p>
I want to get only <p class='link'>try</p>
I have tried this.
/<p class='link'>[^<\/p>]+<\/p>/
But it doesn't work.
How can I can do this?
Thanks,
If that is your string, and you want the text between those p tags, then this should work...
/<p\sclass='link'>(.*?)<\/p>/
The reason yours is not working is because you are adding <\/p> to your not character range. It is not matching it literally, but checking for not each character individually.
Of course, it is mandatory I mention that there are better tools for parsing HTML fragments (such as a HTML parser.)
'/<p[^>]+>([^<]+)<\/p>/'
will get you "try"
It looks like you used this block: [^<\/p>]+ intending to match anything except for </p>. Unfortunately, that's not what it does. A [] block matches any of the characters inside. In your case, the /<p class='link'>[^<\/p>]+ part matched <p class='link'>try</, but it was not immediately followed by the expected </p>, so there was no match.
Alex's solution, to use a non-greedy qualifier is how I tend to approach this sort of problem.
I tried to make one less specific to any particular tag.
(<[^/]+?\s+[^>]*>[^>]*>)
this returns:
<p class='link'>try</p>

Resources